[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-04-22 Thread mforns
mforns added a comment. @diego Hi! Is there anythin additional for us Analytics here? Thaanks TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: mforns Cc: mforns, Marostegui, Isaac, Tbayer,

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-26 Thread Isaac
Isaac added a comment. Hey @JAllemandou - this is great! thanks for catching that - looks all good to me now too. TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Isaac Cc: Marostegui, Isaac,

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-26 Thread JAllemandou
JAllemandou added a comment. Hi @Isaac Sorry for the issue. I correcte the query above (last query, join criteria: `AND ws.sitelink.title = title_namespace_localized` --> `AND REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized` We were not joining correctly on title as

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-25 Thread Isaac
Isaac added a comment. Hey @JAllemandou, some debugging: a number of items aren't showing up and I can't for the life of me figure out. The few I've looked at are pretty normal articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and show up in the original parquet files

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-21 Thread JAllemandou
JAllemandou added a comment. We're on the same page @diego :) I can precompute the table described in ii) if needed, and will surely do it once we'll have the wikidata-dump productioned - Let me know if you need it before TASK DETAIL https://phabricator.wikimedia.org/T215616 EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-21 Thread diego
diego added a comment. I think we are talking about three different things: i) page_id -> CurrentWikidataItem: this was my original request, and I think @JAllemandou 's script solves this issue. Having that table updated would be great. ii) revision_id-> CurrentWikidataItem: This can

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread JAllemandou
JAllemandou added a comment. Thanks @Isaac for reformulating the question I tried to explain above :) @diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread Isaac
Isaac added a comment. @diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread diego
diego added a comment. @JAllemandou , yes. Having this by revision would be great!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: diegoCc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou,

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-19 Thread Isaac
Isaac added a comment. thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-18 Thread JAllemandou
JAllemandou added a comment. Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query: spark.sql("SET spark.sql.shuffle.partitions=128") val wikidataParquetPath =

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread diego
diego added a comment. @Tbayer , great. Thanks.TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: diegoCc: Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan,

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread diego
diego added a comment. @jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki. My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread jcrespo
jcrespo added a comment. diego added a project: DBA. I don't understand what is the actionable here for us. Without context, I would say that: (1) querying the wb_items_per_site table in the wikidatawiki on MariaDB, or (2) through the sitelinks on Wikidata Json dumps. That is not accurate,

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread EBernhardson
EBernhardson added a comment. In T215616#4944986, @diego wrote: @EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that. However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread diego
diego added a comment. @EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that. However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not seems very 'healthy'. We should keep

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread EBernhardson
EBernhardson added a comment. I don't know if this meets your needs, but the cirrussearch dumps have the wikidata id's broken out. This is the wikibase_item field of the ebernhardson.cirrus2hive table in hive. Alternatively there are full dumps with each article as a json object:

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread diego
diego added a comment. Looks good @JAllemandou, thanks. This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T215616: Improve interlingual links across wikis through Wikidata IDs

2019-02-11 Thread JAllemandou
JAllemandou added a comment. @diego : This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) : spark.sql("SET spark.sql.shuffle.partitions=512") val wikidataParquetPath =