[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

Wurgl Wed, 10 Jan 2024 23:26:32 -0800

Okay,

yesterday evening I did the following:


I started this script
##
#!/bin/bash
curl
https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
| bzip2 -d | tail -200
##

With this command:
tools.persondata@tools-sgebastion-11:~$ toolforge jobs run --command
/data/project/persondata/spielwiese/curltest.sh  --image php7.4 -o
/data/project/persondata/logs/curltest.out -e
/data/project/persondata/logs/curltest.err startcurltest

The errorfile curltest.err looks like:
##
tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n'
</data/project/persondata/logs/curltest.err | head -2
 % Total    % Received % Xferd  Average Speed   Time    Time     Time
 Current
                                 Dload  Upload   Total   Spent    Left
 Speed
tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n'
</data/project/persondata/logs/curltest.err | tail -20
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:21 42:51:38
 755k
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:22 42:51:37
 787k
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:23 42:51:36
 770k
 22  141G   22 31.6G    0     0   748k      0 55:09:59 12:18:24 42:51:35
 764k
 22  141G   22 31.6G    0     0   748k      0 55:10:00 12:18:25 42:51:35
 727k
 22  141G   22 31.6G    0     0   748k      0 55:10:00 12:18:26 42:51:34
 708k
 22  141G   22 31.6G    0     0   748k      0 55:10:00 12:18:26 42:51:34
 698k
curl: (18) transfer closed with 118232009816 bytes remaining to read

bzip2: Compressed file ends unexpectedly;
        perhaps it is corrupted?  *Possible* reason follows.
bzip2: Inappropriate ioctl for device
        Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
##

The stdout-file curltest.out looks like
##
tools.persondata@tools-sgebastion-11:~$ tail -3
/data/project/persondata/logs/curltest.out
      <sha1>s3raizvae6sd42yw49j2gy63ecyqclk</sha1>
    </revision>
  </page>
##

Something does not like me very much :-( Maybe some timeout? Maybe some
transfer-limitation? Maybe something different.

Wolfgang


Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF <
ar...@wikimedia.org>:

> I would hazard a guess that your bz2 unzip app does not handle multistream
> files in an appropriate way, Wurgl. The multistream files consist of
> several bzip2-compressed files concatenated together; see
> https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps
> for details.  Try downloading the entire file via curl, and then look into
> the question of the bzip app issues separately. Maybe it will turn out that
> you are encountering some other problem. But first, see if you can download
> the entire file and get its hash to check out.
>
> Ariel
>
> On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica <
> xcoll...@wikimedia.org> wrote:
>
>> Gerhad: Thanks for the extra checks!
>>
>> Wolfgang: I can confirm Gerhad's findings. The file appears correct, and
>> ends with the right footer.
>>
>> On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggon...@gmail.com>
>> wrote:
>>
>>> On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewu...@gmail.com> wrote:
>>> >
>>> > Hello!
>>> >
>>> > I am having some unexpected messages, so I tried the following:
>>> >
>>> > curl -s
>>> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> | bzip2 -d | tail
>>> >
>>> > an got this:
>>> >
>>> > bzip2: Compressed file ends unexpectedly;
>>> >         perhaps it is corrupted?  *Possible* reason follows.
>>> > bzip2: Inappropriate ioctl for device
>>> >         Input file = (stdin), output file = (stdout)
>>> >
>>> > It is possible that the compressed file(s) have become corrupted.
>>>
>>> The file I received was fine and the sha1sum matches that of
>>> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
>>> the posting of Xabriel Collazo Mojica:
>>>
>>> --- 8< ---
>>> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> 1be753ba90e0390c8b65f9b80b08015922da12f1
>>> wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> --- >8 ---
>>>
>>> bunzip2 did not report any problem, however, my first attempt to
>>> decompress ended with a full disk after more that 2.3 TB of xml.
>>>
>>> The second attempt
>>> --- 8< ---
>>> $  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
>>> | tail -n 10000 >
>>> wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml
>>>   wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
>>> --- >8 ---
>>>
>>> resulted in nice XML fragment which ends with
>>> --- 8< ---
>>>   <page>
>>>     <title>Q124069752</title>
>>>     <ns>0</ns>
>>>     <id>118244259</id>
>>>     <revision>
>>>       <id>2042727399</id>
>>>       <parentid>2042727216</parentid>
>>>       <timestamp>2024-01-01T20:37:28Z</timestamp>
>>>       <contributor>
>>>         <username>Kalepom</username>
>>>         <id>1900170</id>
>>>       </contributor>
>>>       <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]:
>>> [[Q16506931]]</comment>
>>>       <model>wikibase-item</model>
>>>       <format>application/json</format>
>>>       <text bytes="2535" xml:space="preserve">...</text>
>>>       <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1>
>>>     </revision>
>>>   </page>
>>> </mediawiki>
>>> --- >8 ---
>>>
>>> So, I assume, your curl did not return the full 142 GB of
>>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 .
>>>
>>> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
>>> out, how big this xml file really is.
>>>
>>> regards, Gerhard
>>> _______________________________________________
>>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>>
>>
>>
>> --
>> Xabriel J. Collazo Mojica (he/him, pronunciation
>> <https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg>
>> )
>> Sr Software Engineer
>> Wikimedia Foundation
>> _______________________________________________
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
> _______________________________________________
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>

_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

Reply via email to