JAllemandou added a comment. |
Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:
spark.sql("SET spark.sql.shuffle.partitions=128") val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204" spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")
spark.sql(""" WITH namespaced_revisions AS ( SELECT wiki_db, page_id, page_title, page_namespace, CASE WHEN (LENGTH(namespace_localized_name) > 0) THEN CONCAT(namespace_localized_name, ':', page_title) ELSE page_title END AS title_namespace_localized FROM ( SELECT wiki_db, page_id, page_title, page_namespace, row_number() OVER (PARTITION BY wiki_db, page_id ORDER BY start_timestamp DESC) as row_num FROM wmf.mediawiki_page_history WHERE snapshot = '2019-01' AND page_id IS NOT NULL AND page_id > 0 AND page_title IS NOT NULL and LENGTH(page_title) > 0 ) ph INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm ON ( ph.wiki_db = nsm.dbname AND ph.page_namespace = nsm.namespace AND nsm.snapshot = '2019-01' ) WHERE row_num = 1 ), wikidata_sitelinks AS ( SELECT id as item_id, EXPLODE(siteLinks) AS sitelink FROM wikidata WHERE size(siteLinks) > 0 ) SELECT item_id, wiki_db, page_id, page_title, page_namespace, title_namespace_localized FROM wikidata_sitelinks ws INNER JOIN namespaced_revisions nsr ON ( ws.sitelink.site = nsr.wiki_db AND ws.sitelink.title = title_namespace_localized ) """).repartition(16).write.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204")
@diego : I can generate similar data for by-revision, but before doing wanted to be sure we agree on the fact that it'll flag every historical revision associated to an item currently linked to a page. Is that what you're after ? Or more of the history of linkagebetween page and item ?
Thanks!
TASK DETAIL
EMAIL PREFERENCES
To: JAllemandou
Cc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
Cc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs