JAllemandou added a comment. |
@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :
spark.sql("SET spark.sql.shuffle.partitions=512") val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001" spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")
val df = spark.sql(""" WITH namespaced_revisions AS ( SELECT wiki_db, revision_id, event_timestamp, page_title, page_namespace, CASE WHEN (LENGTH(namespace_localized_name) > 0) THEN CONCAT(namespace_localized_name, ':', page_title) ELSE page_title END AS title_namespace_localized FROM wmf.mediawiki_history mwh INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm ON ( mwh.wiki_db = nsm.dbname AND mwh.page_namespace = nsm.namespace AND mwh.snapshot = nsm.snapshot ) WHERE mwh.snapshot = '2019-01' AND nsm.snapshot = '2019-01' AND event_entity = 'revision' AND NOT revision_is_deleted ), wikidata_sitelinks AS ( SELECT id as item_id, EXPLODE(siteLinks) AS sitelink FROM wikidata WHERE size(siteLinks) > 0 ) SELECT item_id, wiki_db, revision_id, event_timestamp, page_title, page_namespace FROM wikidata_sitelinks ws INNER JOIN namespaced_revisions nsr ON ( ws.sitelink.site = nsr.wiki_db AND ws.sitelink.title = title_namespace_localized ) """)
TASK DETAIL
EMAIL PREFERENCES
To: JAllemandou
Cc: Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
Cc: Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs