[Wikitech-l] ANN: A Go package providing utilities for processing Wikipedia and Wikidata dumps

2022-01-09 Thread Mitar
Hi!

I just published the first version of a Go package which provides
utilities for processing
Wikidata entities JSON dumps and Wikimedia Enterprise HTML dumps. It
processes them in parallel on multiple cores, so processing is rather
fast. I hope it will be useful to others, too.

https://gitlab.com/tozd/go/mediawiki

Any feedback is welcome.


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/


[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-02 Thread Mitar
Hi!

Thank you for the reply. I made the following tasks:

https://phabricator.wikimedia.org/T298436
https://phabricator.wikimedia.org/T298437


Mitar

On Sat, Jan 1, 2022 at 6:07 PM Ariel Glenn WMF  wrote:
>
> Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful.
>
> For your tar.gz question, this is the format that the Wikimedia Enterprise 
> dataset consumers prefer, from what I understand. But I would suggest that if 
> you are interested in other formats, you might open a task on phabricator 
> with a feature request, and add  the Wikimedia Enterprise project tag ( 
> https://phabricator.wikimedia.org/project/view/4929/ ).
>
> As to the API, I'm only familiar with the endpoints for bulk download, so 
> you'll want to ask the Wikimedia Enterprise folks, or have a look at their 
> API documentation here: 
> https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation
>
> Ariel
>
>
> On Sat, Jan 1, 2022 at 4:30 PM Mitar  wrote:
>>
>> Hi!
>>
>> Awesome!
>>
>> Is there any reason they are tar.gz files of one file and not simply
>> bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
>> that allows parallel decompression. Having both tar (why tar of one
>> file at all?) and gz in there really requires one to first decompress
>> the whole thing before you can process it in parallel. Is there some
>> other way I am missing?
>>
>> Wikipedia dumps are done with multistream bzip2 with an additional
>> index file. That could be nice here too, if one could have an index
>> file and then be able to immediately jump to a JSON line for
>> corresponding articles.
>>
>> Also, is there an API endpoint or Special page which can return the
>> same JSON for a single Wikipedia page? The JSON structure looks very
>> useful by itself (e.g., not in bulk).
>>
>>
>> Mitar
>>
>>
>> On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF  wrote:
>> >
>> > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
>> > October 17-18th are available for public download; see
>> > https://dumps.wikimedia.org/other/enterprise_html/ for more information. We
>> > expect to make updated versions of these files available around the 1st/2nd
>> > of the month and the 20th/21st of the month, following the cadence of the
>> > standard SQL/XML dumps.
>> >
>> > This is still an experimental service, so there may be hiccups from time to
>> > time. Please be patient and report issues as you find them. Thanks!
>> >
>> > Ariel "Dumps Wrangler" Glenn
>> >
>> > [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more
>> > about Wikimedia Enterprise and its API.
>> > ___
>> > Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org
>> > To unsubscribe send an email to wiki-research-l-le...@lists.wikimedia.org
>>
>>
>>
>> --
>> http://mitar.tnode.com/
>> https://twitter.com/mitar_m
>> ___
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> ___
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/


[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-01 Thread Mitar
Hi!

Awesome!

Is there any reason they are tar.gz files of one file and not simply
bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
that allows parallel decompression. Having both tar (why tar of one
file at all?) and gz in there really requires one to first decompress
the whole thing before you can process it in parallel. Is there some
other way I am missing?

Wikipedia dumps are done with multistream bzip2 with an additional
index file. That could be nice here too, if one could have an index
file and then be able to immediately jump to a JSON line for
corresponding articles.

Also, is there an API endpoint or Special page which can return the
same JSON for a single Wikipedia page? The JSON structure looks very
useful by itself (e.g., not in bulk).


Mitar


On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF  wrote:
>
> I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
> October 17-18th are available for public download; see
> https://dumps.wikimedia.org/other/enterprise_html/ for more information. We
> expect to make updated versions of these files available around the 1st/2nd
> of the month and the 20th/21st of the month, following the cadence of the
> standard SQL/XML dumps.
>
> This is still an experimental service, so there may be hiccups from time to
> time. Please be patient and report issues as you find them. Thanks!
>
> Ariel "Dumps Wrangler" Glenn
>
> [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more
> about Wikimedia Enterprise and its API.
> ___
> Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org
> To unsubscribe send an email to wiki-research-l-le...@lists.wikimedia.org



--
http://mitar.tnode.com/
https://twitter.com/mitar_m
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/


Re: [Wikitech-l] ANN: WikiMedia recent changes DDP API

2014-12-18 Thread Mitar
Hi!

New version, with nicer UI/UX. Check it out. :-)


Mitar

On Mon, Dec 15, 2014 at 4:02 AM, Ori Livneh o...@wikimedia.org wrote:
 On Sat, Dec 13, 2014 at 11:01 AM, Mitar mmi...@gmail.com wrote:

 Hi!

 I made a a Meteor DDP API to the stream of recent changes on all
 WikiMedia wikis. Now you can simply use DDP.connect on in your Meteor
 application to connect to stream of changes on Wikipedia. :-)

 http://wikimedia.meteor.com/


 Mitar


 This is really cool!
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] ANN: WikiMedia recent changes DDP API

2014-12-13 Thread Mitar
Hi!

I made a a Meteor DDP API to the stream of recent changes on all
WikiMedia wikis. Now you can simply use DDP.connect on in your Meteor
application to connect to stream of changes on Wikipedia. :-)

http://wikimedia.meteor.com/


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l