Gerhad: Thanks for the extra checks!

Wolfgang: I can confirm Gerhad's findings. The file appears correct, and
ends with the right footer.

On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter <ggon...@gmail.com> wrote:

> On Fri, Jan 5, 2024 at 5:03 PM Wurgl <heisewu...@gmail.com> wrote:
> >
> > Hello!
> >
> > I am having some unexpected messages, so I tried the following:
> >
> > curl -s
> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
> | bzip2 -d | tail
> >
> > an got this:
> >
> > bzip2: Compressed file ends unexpectedly;
> >         perhaps it is corrupted?  *Possible* reason follows.
> > bzip2: Inappropriate ioctl for device
> >         Input file = (stdin), output file = (stdout)
> >
> > It is possible that the compressed file(s) have become corrupted.
>
> The file I received was fine and the sha1sum matches that of
> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
> the posting of Xabriel Collazo Mojica:
>
> --- 8< ---
> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
> 1be753ba90e0390c8b65f9b80b08015922da12f1
> wikidatawiki-latest-pages-articles-multistream.xml.bz2
> --- >8 ---
>
> bunzip2 did not report any problem, however, my first attempt to
> decompress ended with a full disk after more that 2.3 TB of xml.
>
> The second attempt
> --- 8< ---
> $  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
> | tail -n 10000 >
> wikidatawiki-latest-pages-articles-multistream_tail_-n_10000.xml
>   wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
> --- >8 ---
>
> resulted in nice XML fragment which ends with
> --- 8< ---
>   <page>
>     <title>Q124069752</title>
>     <ns>0</ns>
>     <id>118244259</id>
>     <revision>
>       <id>2042727399</id>
>       <parentid>2042727216</parentid>
>       <timestamp>2024-01-01T20:37:28Z</timestamp>
>       <contributor>
>         <username>Kalepom</username>
>         <id>1900170</id>
>       </contributor>
>       <comment>/* wbsetclaim-create:2||1 */ [[Property:P2789]]:
> [[Q16506931]]</comment>
>       <model>wikibase-item</model>
>       <format>application/json</format>
>       <text bytes="2535" xml:space="preserve">...</text>
>       <sha1>9gw926vh84k1b5h6wnuvlvnd2zc3a9b</sha1>
>     </revision>
>   </page>
> </mediawiki>
> --- >8 ---
>
> So, I assume, your curl did not return the full 142 GB of
> wikidatawiki-latest-pages-articles-multistream.xml.bz2 .
>
> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
> out, how big this xml file really is.
>
> regards, Gerhard
> _______________________________________________
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>


-- 
Xabriel J. Collazo Mojica (he/him, pronunciation
<https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciation.ogg>
)
Sr Software Engineer
Wikimedia Foundation
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

Reply via email to