Hello Ariel!

It is not "my bzip2", it is bzip2 on tools-sgebastion-11 in the
toolserver-cloud … well, actually one of the servers which are used, when I
start a script within the kubernetes environment there (with php 7.4)
When you have an account there, you can look at:
/data/project/persondata/dumps/wikidata_sitelinks.sh

The relevant line is this one:
  curl -s
https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
| bzip2 -d | php ~/dumps/wikidata_sitelinks.php

Yes, I double-checked it on my machine at home and the same type of error
happened.

Wolfgang

Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF <
ar...@wikimedia.org>:

> I would hazard a guess that your bz2 unzip app does not handle multistream
> files in an appropriate way, Wurgl. The multistream files consist of
> several bzip2-compressed files concatenated together; see
> https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps
> for details.  Try downloading the entire file via curl, and then look into
> the question of the bzip app issues separately. Maybe it will turn out that
> you are encountering some other problem. But first, see if you can download
> the entire file and get its hash to check out.
>
> Ariel
>
> On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica <
> xcoll...@wikimedia.org> wrote:
>
>> Gerhad: Thanks for the extra checks!
>>
>> Wolfgang: I can confirm Gerhad's findings. The file appears correct, and
>> ends with the right footer.
>>
>> On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggon...@gmail.com>
>> wrote:
>>
>>> On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewu...@gmail.com> wrote:
>>> >
>>> > Hello!
>>> >
>>> > I am having some unexpected messages, so I tried the following:
>>> >
>>> > curl -s
>>> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> | bzip2 -d | tail
>>> >
>>> > an got this:
>>> >
>>> > bzip2: Compressed file ends unexpectedly;
>>> >         perhaps it is corrupted?  *Possible* reason follows.
>>> > bzip2: Inappropriate ioctl for device
>>> >         Input file = (stdin), output file = (stdout)
>>> >
>>> > It is possible that the compressed file(s) have become corrupted.
>>>
>>> The file I received was fine and the sha1sum matches that of
>>> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
>>> the posting of Xabriel Collazo Mojica:
>>>
>>> --- 8< ---
>>> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> 1be753ba90e0390c8b65f9b80b08015922da12f1
>>> wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> --- >8 ---
>>>
>>> bunzip2 did not report any problem, however, my first attempt to
>>> decompress ended with a full disk after more that 2.3 TB of xml.
>>>
>>> The second attempt
>>> --- 8< ---
>>> $  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> | tail -n 10000 >
>>> wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml
>>>   wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
>>> --- >8 ---
>>>
>>> resulted in nice XML fragment which ends with
>>> --- 8< ---
>>>   <page>
>>>     <title>Q124069752</title>
>>>     <ns>0</ns>
>>>     <id>118244259</id>
>>>     <revision>
>>>       <id>2042727399</id>
>>>       <parentid>2042727216</parentid>
>>>       <timestamp>2024-01-01T20:37:28Z</timestamp>
>>>       <contributor>
>>>         <username>Kalepom</username>
>>>         <id>1900170</id>
>>>       </contributor>
>>>       <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]:
>>> [[Q16506931]]</comment>
>>>       <model>wikibase-item</model>
>>>       <format>application/json</format>
>>>       <text bytes="2535" xml:space="preserve">...</text>
>>>       <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1>
>>>     </revision>
>>>   </page>
>>> </mediawiki>
>>> --- >8 ---
>>>
>>> So, I assume, your curl did not return the full 142 GB of
>>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 .
>>>
>>> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
>>> out, how big this xml file really is.
>>>
>>> regards, Gerhard
>>> _______________________________________________
>>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>>
>>
>>
>> --
>> Xabriel J. Collazo Mojica (he/him, pronunciation
>> <https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg>
>> )
>> Sr Software Engineer
>> Wikimedia Foundation
>> _______________________________________________
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
> _______________________________________________
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

Reply via email to