JAllemandou added a comment.

Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:

spark.sql("SET spark.sql.shuffle.partitions=128")
val wikidataParquetPath = "/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204"
spark.read.parquet(wikidataParquetPath).createOrReplaceTempView("wikidata")

spark.sql("""

WITH namespaced_revisions AS (
  SELECT
    wiki_db,
    page_id,
    page_title,
    page_namespace,
    CASE WHEN (LENGTH(namespace_localized_name) > 0)
      THEN CONCAT(namespace_localized_name, ':', page_title)
      ELSE page_title
    END AS title_namespace_localized
  FROM (
    SELECT
      wiki_db,
      page_id,
      page_title,
      page_namespace,
      row_number() OVER (PARTITION BY wiki_db, page_id ORDER BY start_timestamp DESC) as row_num
    FROM wmf.mediawiki_page_history
    WHERE snapshot = '2019-01'
      AND page_id IS NOT NULL AND page_id > 0
      AND page_title IS NOT NULL and LENGTH(page_title) > 0
  ) ph
    INNER JOIN wmf_raw.mediawiki_project_namespace_map nsm
      ON (
        ph.wiki_db = nsm.dbname
        AND ph.page_namespace = nsm.namespace
        AND nsm.snapshot = '2019-01'
      )
  WHERE row_num = 1
),

wikidata_sitelinks AS (
  SELECT
    id as item_id,
    EXPLODE(siteLinks) AS sitelink
  FROM wikidata
  WHERE size(siteLinks) > 0
)

SELECT
  item_id,
  wiki_db,
  page_id,
  page_title,
  page_namespace,
  title_namespace_localized
FROM wikidata_sitelinks ws
  INNER JOIN namespaced_revisions nsr
    ON (
      ws.sitelink.site = nsr.wiki_db
      AND ws.sitelink.title = title_namespace_localized
    )
""").repartition(16).write.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204")

@diego : I can generate similar data for by-revision, but before doing wanted to be sure we agree on the fact that it'll flag every historical revision associated to an item currently linked to a page. Is that what you're after ? Or more of the history of linkagebetween page and item ?
Thanks!


TASK DETAIL
https://phabricator.wikimedia.org/T215616

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou
Cc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan, Rayssa-, Lahi, Gq86, GoranSMilovanovic, QZanden, Marostegui, LawExplorer, Avner, Minhnv-2809, _jensen, Luke081515, Wikidata-bugs, aude, Capt_Swing, Dinoguy1000, Mbch331, Jay8g, Krenair, jeremyb
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to