If you really want to read in all of the data as a single stream, I would suggest writing a preprocessor using SAX library ( from Python, Java or whatever language you want to use ) to break the Wikimedia stream into separate XML files for each page element, or else use the same language to do the streaming CSV conversion .
However, for a file that large, you may have issues if there is a network interruption. Depending on how reliable you connection is, you might be better off downloading the separate chunks. It gives you easily recognizable restart points. Otherwise: Saxon can do streaming XSLT, but only with one of the paid license Enterprise versions. No idea if Saxon XQuery can also handle streaming input. Also, no idea if any of the non-Java versions of Saxon handle streaming. If all that is needed is to convert the XML stream into CSV records to dump into Postgres, I would probably use Python/SAX, but I wonder if Postgres is really a requirement, or if you can do your final queries in BaseX ? If dumping everything in a BaseX database is just an intermediary step, then it’s probably not the most efficient way to go. — Steve M. > On Feb 23, 2020, at 6:54 PM, maxzor < > max...@maxzor.eu> wrote: > > >> Do you mean stream a single large XML file ? A series of XML files, or >> stream a file thru a series of XQuery|XSLT|XPath transforms. >> > Possibly poor wording, I meant read a large XML file and produce i.e. a csv > file. >> I don’t believe BaseX uses a streaming XML parser, so probably can’t handle >> streaming a single large XML file and produce output before it’s parsed the >> complete file. > Do you know of a streaming xml lib? other than StAX (no Java here :<)? >> But it looks like, from the link in your stackoverflow post that the data is >> already sharded into a collection of separate XML files that each contain >> multiple <page> elements. > > This is the alternative, instead of processing the monolithic multistream > file, I could crawl over the ~150MB bz2-compressed chunks. > > Regards, Maxime > >
smime.p7s
Description: S/MIME cryptographic signature