[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
marcmiquel added a comment. I assumed that Categories and Wikipedia: pages coming from language editions would maintain the ns from their origin wiki. We can close this. Now it is clear. Thanks, Addshore. TASK DETAIL https://phabricator.wikimedia.org/T191639 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: marcmiquel Cc: Chicocvenancio, Addshore, marcmiquel, jannee_e, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
Addshore added a comment. ns is not part of the data model for Wikidata / wikibase entities, which is why it does not appear in the JSON dumps. The docs on that page don't mention na other than in the example, which comes from the API and does include the ns. All Items on Wikidata org are in ns 0. TASK DETAIL https://phabricator.wikimedia.org/T191639 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore Cc: Chicocvenancio, Addshore, marcmiquel, jannee_e, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
marcmiquel added a comment. The wikidata dump still does not include the namespace tag. It is specified in the JSON DataModel and it would be useful for the same use I explained in this task. https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON Could you give me an update on this? TASK DETAIL https://phabricator.wikimedia.org/T191639 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: marcmiquel Cc: Chicocvenancio, Addshore, marcmiquel, jannee_e, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
marcmiquel added a comment. I need all the Wikidata qitems that relate to Wikipedia articles. If I understand it correctly, these are qitems that have namespace 0. Although not all qitems with namespace 0 necessarily have sitelinks (they could be just qitems without an article). The thing is that I'm not sure all wikidata qitems have namespace main (0). I explain you what I did. Since I cannot use the namespace XML tag in the dump to just parse the namespace 0 and skip the rest I managed to use the wikidata mysql replica database. In this case, I consulted: select count(page_namespace), page_namespace from page group by page_namespace order by 1 desc; This is the result: ---++ count(page_namespace)page_namespace +---++ 569860530 1522501198 450223 42573146 362044 333202 165411 108742600 746410 73715 5940121 5887120 367514 30328 180012 462828 2989 19311 13113 66829 62147 1415 37 31199 +---++ So it seems that there are many pages with namespace 1198, 146, 2600... besides 3, 4, 2, 1 which are user talk, project, user page, talk page. I don't know how many of these are in the dump. But I only need those which are 0. So, the solution that I found is retrieving all the qitems with namespace 0 from the wikidata replica mysql database and storing them into a database. Then I consult this database when parsing and I skip those which haven't been previously inserted. This way I the parsing is shorter. Do you think there is any other way to do it? Thanks.TASK DETAILhttps://phabricator.wikimedia.org/T191639EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: marcmiquelCc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
Addshore added a comment. In T191639#4937630, @marcmiquel wrote: The use case is to process the dumps and filter out qitems which do not relate to articles, this is why we put NS0. That sounds like you are referring to the namespace of the sitelinks of the entity? On wikidata.org all "qitems" are in the main namespace, which is namespace 0. The sitelinks held within those items can be on any number of different namespaces on the wikidata clients. The JSON dump sample says there is ns field but in the final dump there is no such field. This is the namespace of the item itself, not of the sitelinks. Could you link to those docs? it could be that they are only meant for the API serialization.?TASK DETAILhttps://phabricator.wikimedia.org/T191639EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: AddshoreCc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
marcmiquel added a comment. The use case is to process the dumps and filter out qitems which do not relate to articles, this is why we put NS0. The JSON dump sample says there is ns field but in the final dump there is no such field.TASK DETAILhttps://phabricator.wikimedia.org/T191639EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: marcmiquelCc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)
Addshore added a comment. Hmm, what's the usecase here? Is this for wikidata dumps? Right now Items being in NS 0 is a pretty safe assumption, they don't appear anywhere else.TASK DETAILhttps://phabricator.wikimedia.org/T191639EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: AddshoreCc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs