Hi gnosygnu! The JSON in the XML dumps is the raw contents of the storage backend. It can't be changed retroactively, and re-encoding everything on the fly would be too expensive. Also, the JSON embedded in the XML files is not officially supported as a stable interface of Wikibase. The JSON format in the XML files can change without notice, and you may encounter different representations even within the same dump.
I recommend to use the JSON dumps, they contain our data in canonical form. To avoid downloading redundant information, you can use one of the wikidatawiki-20161120-stub-* dumps instead of the full page dumps. These don't contain the actual page content, just meta-data. Caveat: there is currently no dump that contains the JSON of old revisions of entities in canonical form. You can only get them individually from Special:EntityData, e.g. <https://www.wikidata.org/wiki/Special:EntityData/Q23.json?oldid=30279> HTH -- daniel Am 26.11.2016 um 02:13 schrieb gnosygnu: > Hi everyone. I have a question about the Wikidata xml dump, but I'm > posting this question here, because it looks more related to Wikidata. > > In short, it seems that the "pages-articles.xml" does not include the > datatype property for snaks. For example, the xml dump does not list a > datatype for Q38 (Italy) and P41 (flag image). In contrast, the json > dump does list a datatype of "commonsMedia". > > Can this datatype property be included in future xml dumps? The > alternative would be to download two large and redundant dumps (xml > and json) in order to reconstruct a Wikidata instance. > > More information is provided below the break. Let me know if you need > anything else. > > Thanks. > > ---- > > Here's an excerpt from the xml data dump for Q38 (Italy) and P41 (flag > image). Notice that there is no "datatype" property > // > https://dumps.wikimedia.org/wikidatawiki/20161120/wikidatawiki-20161120-pages-articles.xml.bz2 > "mainsnak": { > "snaktype": "value", > "property": "P41", > "hash": "a3bd1e026c51f5e0bdf30b2323a7a1fb913c9863", > "datavalue": { > "value": "Flag of Italy.svg", > "type": "string" > } > }, > > Meanwhile, the API and the JSON dump lists a datatype property of > "commonsMedia": > // https://www.wikidata.org/w/api.php?action=wbgetentities&ids=q38 > // > https://dumps.wikimedia.org/wikidatawiki/entities/20161114/wikidata-20161114-all.json.bz2 > "P41": [{ > "mainsnak": { > "snaktype": "value", > "property": "P41", > "datavalue": { > "value": "Flag of Italy.svg", > "type": "string" > }, > "datatype": "commonsMedia" > }, > > As far as I can tell, the Turtle (ttl) dump does not list a datatype > property either, but this may be because I don't understand its > format. > wd:Q38 p:P41 wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D . > wds:q38-574446A6-FD05-47AE-86E3-AA745993B65D a wikibase:Statement, > wikibase:BestRank ; > wikibase:rank wikibase:NormalRank ; > ps:P41 > <http://commons.wikimedia.org/wiki/Special:FilePath/Flag%20of%20Italy.svg> > ; > pq:P580 "1946-06-19T00:00:00Z"^^xsd:dateTime ; > pqv:P580 wdv:204e90b1bce9f96d6d4ff632a8da0ecc . > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata