[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2024-04-08 Thread Lydia_Pintscher
Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking]. TASK DETAIL https://phabricator.wikimedia.org/T291089 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Lydia_Pintscher Cc: Michael, So9q, WMDE-leszek, Zbyszko,

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-16 Thread odimitrijevic
odimitrijevic edited projects, added Analytics-Radar; removed Analytics. odimitrijevic moved this task from Incoming to Datasets on the Analytics board. TASK DETAIL https://phabricator.wikimedia.org/T291089 WORKBOARD https://phabricator.wikimedia.org/project/board/11/ EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Addshore
Addshore added a comment. From IRC > 7:32 PM <+dcausse> addshore: I'm not convinced that RecentChanges is more reliable than the revision-create stream, using this stream did improve consistency of wdqs IIRC This is indeed true, RC is also not a totally reliable thing. Really

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Addshore
Addshore added a comment. > Perhaps the existing logic in the WDQS updater to generate its RDF stream could be factored out into its own service? Or, at least, it could emit its RDF stream as a side output into a Kafka topic? This relies on the kafka event streams right now, thus also

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Ottomata
Ottomata added a comment. > I imagine other sources like https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams would all have the same problems? Yes, EventStreams uses the same data. TASK DETAIL https://phabricator.wikimedia.org/T291089 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Ottomata
Ottomata added a comment. > And the new query service flink updater could also make use of the RDF stream Perhaps the existing logic in the WDQS updater to generate its RDF stream could be factored out into its own service? Or, at least, it could emit its RDF stream as a side output

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Addshore
Addshore added a comment. In T291089#7356181 , @Ottomata wrote: >> a reliable and consistent input (such as MediaWiki recent changes) > > I guess by this you mean polling the MW RecentChanges API? Yes I imagine that is the only

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Ottomata
Ottomata added a comment. > a reliable and consistent input (such as MediaWiki recent changes) I guess by this you mean polling the MW RecentChanges API? TASK DETAIL https://phabricator.wikimedia.org/T291089 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Addshore
Addshore updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T291089 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore Cc: Ladsgroup, dcausse, ArielGlenn, Ottomata, Addshore, Invadibot, maantietaja, jannee_e,

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Addshore
Addshore added a comment. Cross linking to T290839: Evaluate a double backend strategy for WDQS which I was reading when I decided to write this down, particularly when reading T290839#7354690

[Wikidata-bugs] [Maniphest] T291089: Proposal: Generate Wikidata JSON & RDF dumps from Hadoop

2021-09-15 Thread Addshore
Addshore created this task. Addshore added projects: Analytics, Dumps-Generation, Wikidata, wdwb-tech. TASK DESCRIPTION Wikidata dumps currently come directly from the SQL servers. The general process here is iterate through all pages, and slowly write all content to files (possibly in