https://bugzilla.wikimedia.org/show_bug.cgi?id=54369

       Web browser: ---
            Bug ID: 54369
           Summary: Set up generation of JSON dumps for wikidata.org
           Product: Datasets
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: ar...@wikimedia.org
          Reporter: daniel.kinz...@wikimedia.de
    Classification: Unclassified
   Mobile Platform: ---

We would like to make the contents of wikidata.org available as a dump using
our canonical JSON format. The maintenance script for doing this is 

  extensions/Wikibase/repo/maintenance/dumpJson.php

This will send a JSON serialization of all data entities to standard output, so
I suppose that would best be piped through bz2.

This should work as-is, but there are several things that we should look out
for or try out:

* I don't know how long it will take to make a complete dump. I expect that
it'll be roughly the same as making an XML dump of the current revisions.
* I don't know how much RAM is required. Currently, all the IDs of the entities
to output will be loaded into memory (by virtue of how the MySQL client library
works) - that's a few dozen million rows. AS a guess, 1GB should be enough. 
* We may have to make the script more resilient to sporadic failures,
especially since a failure would currently mean restarting the dump. 
* Perhaps sharding would be useful: the script supports --sharding-factor and
--shard to control how m,any shards there should be, and which shard the script
should process. Combining the output files is not as seamless as it could be,
though (it involves chopping off lines at the beginning and the end of files).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to