Hello Ariel! It is not "my bzip2", it is bzip2 on tools-sgebastion-11 in the toolserver-cloud … well, actually one of the servers which are used, when I start a script within the kubernetes environment there (with php 7.4) When you have an account there, you can look at: /data/project/persondata/dumps/wikidata_sitelinks.sh
The relevant line is this one: curl -s https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 | bzip2 -d | php ~/dumps/wikidata_sitelinks.php Yes, I double-checked it on my machine at home and the same type of error happened. Wolfgang Am Mi., 10. Jan. 2024 um 16:29 Uhr schrieb Ariel Glenn WMF < ar...@wikimedia.org>: > I would hazard a guess that your bz2 unzip app does not handle multistream > files in an appropriate way, Wurgl. The multistream files consist of > several bzip2-compressed files concatenated together; see > https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps > for details. Try downloading the entire file via curl, and then look into > the question of the bzip app issues separately. Maybe it will turn out that > you are encountering some other problem. But first, see if you can download > the entire file and get its hash to check out. > > Ariel > > On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica < > xcoll...@wikimedia.org> wrote: > >> Gerhad: Thanks for the extra checks! >> >> Wolfgang: I can confirm Gerhad's findings. The file appears correct, and >> ends with the right footer. >> >> On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggon...@gmail.com> >> wrote: >> >>> On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewu...@gmail.com> wrote: >>> > >>> > Hello! >>> > >>> > I am having some unexpected messages, so I tried the following: >>> > >>> > curl -s >>> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> | bzip2 -d | tail >>> > >>> > an got this: >>> > >>> > bzip2: Compressed file ends unexpectedly; >>> > perhaps it is corrupted? *Possible* reason follows. >>> > bzip2: Inappropriate ioctl for device >>> > Input file = (stdin), output file = (stdout) >>> > >>> > It is possible that the compressed file(s) have become corrupted. >>> >>> The file I received was fine and the sha1sum matches that of >>> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in >>> the posting of Xabriel Collazo Mojica: >>> >>> --- 8< --- >>> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> 1be753ba90e0390c8b65f9b80b08015922da12f1 >>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> --- >8 --- >>> >>> bunzip2 did not report any problem, however, my first attempt to >>> decompress ended with a full disk after more that 2.3 TB of xml. >>> >>> The second attempt >>> --- 8< --- >>> $ bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2 >>> | tail -n 10000 > >>> wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml >>> wikidatawiki-latest-pages-articles-multistream.xml.bz2: done >>> --- >8 --- >>> >>> resulted in nice XML fragment which ends with >>> --- 8< --- >>> <page> >>> <title>Q124069752</title> >>> <ns>0</ns> >>> <id>118244259</id> >>> <revision> >>> <id>2042727399</id> >>> <parentid>2042727216</parentid> >>> <timestamp>2024-01-01T20:37:28Z</timestamp> >>> <contributor> >>> <username>Kalepom</username> >>> <id>1900170</id> >>> </contributor> >>> <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]: >>> [[Q16506931]]</comment> >>> <model>wikibase-item</model> >>> <format>application/json</format> >>> <text bytes="2535" xml:space="preserve">...</text> >>> <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1> >>> </revision> >>> </page> >>> </mediawiki> >>> --- >8 --- >>> >>> So, I assume, your curl did not return the full 142 GB of >>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 . >>> >>> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find >>> out, how big this xml file really is. >>> >>> regards, Gerhard >>> _______________________________________________ >>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org >>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >>> >> >> >> -- >> Xabriel J. Collazo Mojica (he/him, pronunciation >> <https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg> >> ) >> Sr Software Engineer >> Wikimedia Foundation >> _______________________________________________ >> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org >> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >> > _______________________________________________ > Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org > To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >
_______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org