[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-02 Thread Mitar
Hi! Thank you for the reply. I made the following tasks: https://phabricator.wikimedia.org/T298436 https://phabricator.wikimedia.org/T298437 Mitar On Sat, Jan 1, 2022 at 6:07 PM Ariel Glenn WMF wrote: > > Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful. > > For

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-01 Thread Ariel Glenn WMF
Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful. For your tar.gz question, this is the format that the Wikimedia Enterprise dataset consumers prefer, from what I understand. But I would suggest that if you are interested in other formats, you might open a task on

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-01 Thread Mitar
Hi! Awesome! Is there any reason they are tar.gz files of one file and not simply bzip2 of the file contents? Wikidata dumps are bzip2 of one json and that allows parallel decompression. Having both tar (why tar of one file at all?) and gz in there really requires one to first decompress the

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2021-10-19 Thread Andrew Otto
Wow very cool! On Tue, Oct 19, 2021 at 10:57 AM Ariel Glenn WMF wrote: > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for > October 17-18th are available for public download; see > https://dumps.wikimedia.org/other/enterprise_html/ for more information. > We > expect to