[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

Ariel Glenn WMF Sat, 01 Jan 2022 09:07:10 -0800

Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful.


For your tar.gz question, this is the format that the Wikimedia Enterprise
dataset consumers prefer, from what I understand. But I would suggest that
if you are interested in other formats, you might open a task on
phabricator with a feature request, and add  the Wikimedia Enterprise
project tag ( https://phabricator.wikimedia.org/project/view/4929/ ).

As to the API, I'm only familiar with the endpoints for bulk download, so
you'll want to ask the Wikimedia Enterprise folks, or have a look at their
API documentation here:
https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation

Ariel


On Sat, Jan 1, 2022 at 4:30 PM Mitar <mmi...@gmail.com> wrote:

> Hi!
>
> Awesome!
>
> Is there any reason they are tar.gz files of one file and not simply
> bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
> that allows parallel decompression. Having both tar (why tar of one
> file at all?) and gz in there really requires one to first decompress
> the whole thing before you can process it in parallel. Is there some
> other way I am missing?
>
> Wikipedia dumps are done with multistream bzip2 with an additional
> index file. That could be nice here too, if one could have an index
> file and then be able to immediately jump to a JSON line for
> corresponding articles.
>
> Also, is there an API endpoint or Special page which can return the
> same JSON for a single Wikipedia page? The JSON structure looks very
> useful by itself (e.g., not in bulk).
>
>
> Mitar
>
>
> On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
> >
> > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
> > October 17-18th are available for public download; see
> > https://dumps.wikimedia.org/other/enterprise_html/ for more
> information. We
> > expect to make updated versions of these files available around the
> 1st/2nd
> > of the month and the 20th/21st of the month, following the cadence of the
> > standard SQL/XML dumps.
> >
> > This is still an experimental service, so there may be hiccups from time
> to
> > time. Please be patient and report issues as you find them. Thanks!
> >
> > Ariel "Dumps Wrangler" Glenn
> >
> > [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much
> more
> > about Wikimedia Enterprise and its API.
> > _______________________________________________
> > Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org
> > To unsubscribe send an email to
> wiki-research-l-le...@lists.wikimedia.org
>
>
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

Reply via email to