Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful.
For your tar.gz question, this is the format that the Wikimedia Enterprise dataset consumers prefer, from what I understand. But I would suggest that if you are interested in other formats, you might open a task on phabricator with a feature request, and add the Wikimedia Enterprise project tag ( https://phabricator.wikimedia.org/project/view/4929/ ). As to the API, I'm only familiar with the endpoints for bulk download, so you'll want to ask the Wikimedia Enterprise folks, or have a look at their API documentation here: https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation Ariel On Sat, Jan 1, 2022 at 4:30 PM Mitar <mmi...@gmail.com> wrote: > Hi! > > Awesome! > > Is there any reason they are tar.gz files of one file and not simply > bzip2 of the file contents? Wikidata dumps are bzip2 of one json and > that allows parallel decompression. Having both tar (why tar of one > file at all?) and gz in there really requires one to first decompress > the whole thing before you can process it in parallel. Is there some > other way I am missing? > > Wikipedia dumps are done with multistream bzip2 with an additional > index file. That could be nice here too, if one could have an index > file and then be able to immediately jump to a JSON line for > corresponding articles. > > Also, is there an API endpoint or Special page which can return the > same JSON for a single Wikipedia page? The JSON structure looks very > useful by itself (e.g., not in bulk). > > > Mitar > > > On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > > > > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for > > October 17-18th are available for public download; see > > https://dumps.wikimedia.org/other/enterprise_html/ for more > information. We > > expect to make updated versions of these files available around the > 1st/2nd > > of the month and the 20th/21st of the month, following the cadence of the > > standard SQL/XML dumps. > > > > This is still an experimental service, so there may be hiccups from time > to > > time. Please be patient and report issues as you find them. Thanks! > > > > Ariel "Dumps Wrangler" Glenn > > > > [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much > more > > about Wikimedia Enterprise and its API. > > _______________________________________________ > > Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org > > To unsubscribe send an email to > wiki-research-l-le...@lists.wikimedia.org > > > > -- > http://mitar.tnode.com/ > https://twitter.com/mitar_m > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/