mforns added a comment.
@diego Hi! Is there anythin additional for us Analytics here? Thaanks
TASK DETAIL
https://phabricator.wikimedia.org/T215616
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: mforns
Cc: mforns, Marostegui, Isaac, Tbayer,
Isaac added a comment.
Hey @JAllemandou - this is great! thanks for catching that - looks all good
to me now too.
TASK DETAIL
https://phabricator.wikimedia.org/T215616
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Isaac
Cc: Marostegui, Isaac,
JAllemandou added a comment.
Hi @Isaac
Sorry for the issue. I correcte the query above (last query, join criteria:
`AND ws.sitelink.title = title_namespace_localized` --> `AND
REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized`
We were not joining correctly on title as
Isaac added a comment.
Hey @JAllemandou, some debugging: a number of items aren't showing up and I
can't for the life of me figure out. The few I've looked at are pretty normal
articles (for example: https://de.wikipedia.org/wiki/Gregor_Grillemeier) and
show up in the original parquet files
JAllemandou added a comment.
We're on the same page @diego :)
I can precompute the table described in ii) if needed, and will surely do it
once we'll have the wikidata-dump productioned - Let me know if you need it
before
TASK DETAIL
https://phabricator.wikimedia.org/T215616
EMAIL
diego added a comment.
I think we are talking about three different things:
i) page_id -> CurrentWikidataItem: this was my original request, and I think
@JAllemandou 's script solves this issue. Having that table updated would be
great.
ii) revision_id-> CurrentWikidataItem: This can
JAllemandou added a comment.
Thanks @Isaac for reformulating the question I tried to explain above :)
@diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL
Isaac added a comment.
@diego: my interpretation is that right now in the revision history version, the same wikidb/page ID/title is associated with the same wikidata ID regardless of when the revision occurred. what is the use for that over a table that has just one entry per wikidb/page
diego added a comment.
@JAllemandou , yes. Having this by revision would be great!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: diegoCc: Isaac, Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou,
Isaac added a comment.
thank you @JAllemandou this is awesome!!! completely unblocks me (i have a bunch of page titles across all the wikipedias and need to check whether a pair of them match the same wikidata item)!TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL
JAllemandou added a comment.
Hi @Isaac, I have generated some parquet data here /user/joal/wmf/data/wmf/wikidata/item_page_link/20190204 with the following query:
spark.sql("SET spark.sql.shuffle.partitions=128")
val wikidataParquetPath =
diego added a comment.
@Tbayer , great. Thanks.TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: diegoCc: Tbayer, jcrespo, EBernhardson, Halfak, Nuria, JAllemandou, diego, Nandana, Akovalyov, Banyek, AndyTan,
diego added a comment.
@jcrespo, the API works good for query specific pages/entities, not for example to know which pages that existing in X_wiki are missing on the Y_wiki.
My point here it is that the wikidata identifier is currently the main identifier for a page/concept, and that this fact
jcrespo added a comment.
diego added a project: DBA.
I don't understand what is the actionable here for us. Without context, I would say that:
(1) querying the wb_items_per_site table in the wikidatawiki on MariaDB, or
(2) through the sitelinks on Wikidata Json dumps.
That is not accurate,
EBernhardson added a comment.
In T215616#4944986, @diego wrote:
@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.
However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not
diego added a comment.
@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.
However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not seems very 'healthy'. We should keep
EBernhardson added a comment.
I don't know if this meets your needs, but the cirrussearch dumps have the wikidata id's broken out. This is the wikibase_item field of the ebernhardson.cirrus2hive table in hive. Alternatively there are full dumps with each article as a json object:
diego added a comment.
Looks good @JAllemandou, thanks.
This is a good workaround, but imho, we should have an structure or schema that makes this kind of tasks easier, specially for people outside without access to a cluster.TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL
JAllemandou added a comment.
@diego :
This has worked for me (takes some time to compute and needs a bunch of resources). I hope it's close enough to what you want :) :
spark.sql("SET spark.sql.shuffle.partitions=512")
val wikidataParquetPath =
19 matches
Mail list logo