If you really want to read in all of the data as a single stream, I would 
suggest writing a preprocessor using SAX library ( from Python, Java or 
whatever language you want to use ) to break the Wikimedia stream into separate 
XML files for each page element, or else use the same language to do the 
streaming CSV conversion . 

However, for a file that large, you may have issues if there is a network 
interruption. 

Depending on how reliable you connection is, you might be better off 
downloading the separate chunks. It gives you easily recognizable restart 
points. 


Otherwise: Saxon can do streaming XSLT, but only with one of the paid license 
Enterprise versions. No idea if Saxon XQuery can also handle streaming input. 
Also, no idea if any of the non-Java versions of Saxon handle streaming. 


If all that is needed is to convert the XML stream into CSV records to dump 
into Postgres, I would probably use Python/SAX, but I wonder if Postgres is 
really a requirement, or if you can do your final queries in BaseX ?  If 
dumping everything in a BaseX database is just an intermediary step, then it’s 
probably not the most efficient way to go. 

— Steve M.


> On Feb 23, 2020, at 6:54 PM, maxzor <
> max...@maxzor.eu> wrote:
> 
> 
>> Do you mean stream a single large XML file ? A series of XML files, or 
>> stream a file thru a series of XQuery|XSLT|XPath transforms.
>> 
> Possibly poor wording, I meant read a large XML file and produce i.e. a csv 
> file.
>> I don’t believe BaseX uses a streaming XML parser, so probably can’t handle 
>> streaming a single large XML file and produce output before it’s parsed the 
>> complete file.
> Do you know of a streaming xml lib? other than StAX (no Java here :<)?
>> But it looks like, from the link in your stackoverflow post that the data is 
>> already sharded into a collection of separate XML files that each contain 
>> multiple <page> elements.
> 
> This is the alternative, instead of processing the monolithic multistream 
> file, I could crawl over the ~150MB bz2-compressed chunks.
> 
> Regards, Maxime
> 
> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to