Okay, yesterday evening I did the following:
I started this script ## #!/bin/bash curl https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -d | tail -200 ## With this command: tools.persondata@tools-sgebastion-11:~$ toolforge jobs run --command /data/project/persondata/spielwiese/curltest.sh --image php7.4 -o /data/project/persondata/logs/curltest.out -e /data/project/persondata/logs/curltest.err startcurltest The errorfile curltest.err looks like: ## tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n' </data/project/persondata/logs/curltest.err | head -2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed tools.persondata@tools-sgebastion-11:~$ tr '\r' '\n' </data/project/persondata/logs/curltest.err | tail -20 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:21 42:51:38 755k 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:22 42:51:37 787k 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:23 42:51:36 770k 22 141G 22 31.6G 0 0 748k 0 55:09:59 12:18:24 42:51:35 764k 22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:25 42:51:35 727k 22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:26 42:51:34 708k 22 141G 22 31.6G 0 0 748k 0 55:10:00 12:18:26 42:51:34 698k curl: (18) transfer closed with 118232009816 bytes remaining to read bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? *Possible* reason follows. bzip2: Inappropriate ioctl for device Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. ## The stdout-file curltest.out looks like ## tools.persondata@tools-sgebastion-11:~$ tail -3 /data/project/persondata/logs/curltest.out <sha1>s3raizvae6sd42yw49j2gy63ecyqclk</sha1> </revision> </page> ## Something does not like me very much :-( Maybe some timeout? Maybe some transfer-limitation? Maybe something different. Wolfgang Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF < ar...@wikimedia.org>: > I would hazard a guess that your bz2 unzip app does not handle multistream > files in an appropriate way, Wurgl. The multistream files consist of > several bzip2-compressed files concatenated together; see > https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps > for details. Try downloading the entire file via curl, and then look into > the question of the bzip app issues separately. Maybe it will turn out that > you are encountering some other problem. But first, see if you can download > the entire file and get its hash to check out. > > Ariel > > On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica < > xcoll...@wikimedia.org> wrote: > >> Gerhad: Thanks for the extra checks! >> >> Wolfgang: I can confirm Gerhad's findings. The file appears correct, and >> ends with the right footer. >> >> On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggon...@gmail.com> >> wrote: >> >>> On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewu...@gmail.com> wrote: >>> > >>> > Hello! >>> > >>> > I am having some unexpected messages, so I tried the following: >>> > >>> > curl -s >>> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> | bzip2 -d | tail >>> > >>> > an got this: >>> > >>> > bzip2: Compressed file ends unexpectedly; >>> > perhaps it is corrupted? *Possible* reason follows. >>> > bzip2: Inappropriate ioctl for device >>> > Input file = (stdin), output file = (stdout) >>> > >>> > It is possible that the compressed file(s) have become corrupted. >>> >>> The file I received was fine and the sha1sum matches that of >>> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in >>> the posting of Xabriel Collazo Mojica: >>> >>> --- 8< --- >>> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> 1be753ba90e0390c8b65f9b80b08015922da12f1 >>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> --- >8 --- >>> >>> bunzip2 did not report any problem, however, my first attempt to >>> decompress ended with a full disk after more that 2.3 TB of xml. >>> >>> The second attempt >>> --- 8< --- >>> $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> | tail -n 10000 > >>> wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml >>> wikidatawiki-latest-pages-articles-multistream.xml.bz2: done >>> --- >8 --- >>> >>> resulted in nice XML fragment which ends with >>> --- 8< --- >>> <page> >>> <title>Q124069752</title> >>> <ns>0</ns> >>> <id>118244259</id> >>> <revision> >>> <id>2042727399</id> >>> <parentid>2042727216</parentid> >>> <timestamp>2024-01-01T20:37:28Z</timestamp> >>> <contributor> >>> <username>Kalepom</username> >>> <id>1900170</id> >>> </contributor> >>> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: >>> [[Q16506931]]</comment> >>> <model>wikibase-item</model> >>> <format>application/json</format> >>> <text bytes="2535" xml:space="preserve">...</text> >>> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> >>> </revision> >>> </page> >>> </mediawiki> >>> --- >8 --- >>> >>> So, I assume, your curl did not return the full 142 GB of >>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 . >>> >>> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find >>> out, how big this xml file really is. >>> >>> regards, Gerhard >>> _______________________________________________ >>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org >>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >>> >> >> >> -- >> Xabriel J. Collazo Mojica (he/him, pronunciation >> <https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg> >> ) >> Sr Software Engineer >> Wikimedia Foundation >> _______________________________________________ >> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org >> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >> > _______________________________________________ > Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org > To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >
_______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org