[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

Ariel Glenn WMF Wed, 10 Jan 2024 07:29:31 -0800

I would hazard a guess that your bz2 unzip app does not handle multistream
files in an appropriate way, Wurgl. The multistream files consist of
several bzip2-compressed files concatenated together; see
https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps
for details.  Try downloading the entire file via curl, and then look into
the question of the bzip app issues separately. Maybe it will turn out that
you are encountering some other problem. But first, see if you can download
the entire file and get its hash to check out.


Ariel

On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica <
xcoll...@wikimedia.org> wrote:

> Gerhad: Thanks for the extra checks!
>
> Wolfgang: I can confirm Gerhad's findings. The file appears correct, and
> ends with the right footer.
>
> On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggon...@gmail.com> wrote:
>
>> On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewu...@gmail.com> wrote:
>> >
>> > Hello!
>> >
>> > I am having some unexpected messages, so I tried the following:
>> >
>> > curl -s
>> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> | bzip2 -d | tail
>> >
>> > an got this:
>> >
>> > bzip2: Compressed file ends unexpectedly;
>> >         perhaps it is corrupted?  *Possible* reason follows.
>> > bzip2: Inappropriate ioctl for device
>> >         Input file = (stdin), output file = (stdout)
>> >
>> > It is possible that the compressed file(s) have become corrupted.
>>
>> The file I received was fine and the sha1sum matches that of
>> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
>> the posting of Xabriel Collazo Mojica:
>>
>> --- 8< ---
>> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> 1be753ba90e0390c8b65f9b80b08015922da12f1
>> wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> --- >8 ---
>>
>> bunzip2 did not report any problem, however, my first attempt to
>> decompress ended with a full disk after more that 2.3 TB of xml.
>>
>> The second attempt
>> --- 8< ---
>> $  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> | tail -n 10000 >
>> wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml
>>   wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
>> --- >8 ---
>>
>> resulted in nice XML fragment which ends with
>> --- 8< ---
>>   <page>
>>     <title>Q124069752</title>
>>     <ns>0</ns>
>>     <id>118244259</id>
>>     <revision>
>>       <id>2042727399</id>
>>       <parentid>2042727216</parentid>
>>       <timestamp>2024-01-01T20:37:28Z</timestamp>
>>       <contributor>
>>         <username>Kalepom</username>
>>         <id>1900170</id>
>>       </contributor>
>>       <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]:
>> [[Q16506931]]</comment>
>>       <model>wikibase-item</model>
>>       <format>application/json</format>
>>       <text bytes="2535" xml:space="preserve">...</text>
>>       <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1>
>>     </revision>
>>   </page>
>> </mediawiki>
>> --- >8 ---
>>
>> So, I assume, your curl did not return the full 142 GB of
>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 .
>>
>> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
>> out, how big this xml file really is.
>>
>> regards, Gerhard
>> _______________________________________________
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
>
>
> --
> Xabriel J. Collazo Mojica (he/him, pronunciation
> <https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg>
> )
> Sr Software Engineer
> Wikimedia Foundation
> _______________________________________________
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>

_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

Reply via email to