JAllemandou added a comment.

@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :

spark.sql("SET spark.sql.shuffle.partitions=512")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")

val df = spark.sql("""

WITH namespaced_revisions AS (
  SELECT
    wiki_db,
    revision_id,
    event_timestamp,
    page_title,
    page_namespace,
    CASE WHEN (LENGTH(namespace_localized_name) > 0)
      THEN CONCAT(namespace_localized_name, ':', page_title)
      ELSE page_title
    END AS title_namespace_localized
  FROM wmf.mediawiki_history mwh
    INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
      ON (
        mwh.wiki_db = nsm.dbname
        AND mwh.page_namespace = nsm.namespace
        AND mwh.snapshot = nsm.snapshot
      )
  WHERE mwh.snapshot = '2019-01'
    AND nsm.snapshot = '2019-01'
    AND event_entity = 'revision'
    AND NOT revision_is_deleted
),

wikidata_sitelinks AS (
  SELECT
    id as item_id,
    EXPLODE(siteLinks) AS sitelink
  FROM wikidata
  WHERE size(siteLinks) > 0
)

SELECT
  item_id,
  wiki_db,
  revision_id,
  event_timestamp,
  page_title,
  page_namespace
FROM wikidata_sitelinks ws
  INNER JOIN namespaced_revisions nsr
    ON (
      ws.sitelink.site = nsr.wiki_db
      AND ws.sitelink.title = title_namespace_localized
    )
""")

TASK DETAIL
https://phabricator.wikimedia.org/T215616

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou
Cc: Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to