Petr, could you please elaborate more on this last claim? If turning the dump generation into an incremental process is the task you are interested in solving, then I don't understand how text constitutes a problem. Text files can be appended to as any regular file and it shouldn't be difficult to do this in a way that preserves the XML structure valid.
As I said, having the possibility to seek and inspect the files manually is a tremendous boon when debugging your code. With what you propose that would be possible but more complicate, since one cannot seek at a specific position of stdout without going through the whole contents. Best Giovanni On Jul 3, 2013 4:05 PM, "Petr Onderka" <gsv...@gmail.com> wrote: > A reply to all those who basically want to keep the current XML dumps: > > I have decided to change the primary way of reading the dumps: it will now > be a command line application that outputs the data as uncompressed XML, in > the same format as current dumps. > > This way, you should be able to use the new dumps with minimal changes to > your code. > > Keeping the dumps in a text-based format doesn't make sense, because that > can't be updated efficiently, which is the whole reason for the new dumps. > > Petr Onderka > > > On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byr...@vip.cybercity.dk>wrote: > >> Hi, >> >> As a regular of user of dump files I would not want a "fancy" file format >> with indexes stored as trees etc. >> >> I parse all the dump files (both for SQL tables and the XML files) with a >> one pass parser which inserts the data I want (which sometimes is only a >> small fraction of the total amount of data in the file) into my local >> database. I will normally never store uncompressed dump files, but pipe the >> uncompressed data directly from bunzip or gunzip to my parser to save disk >> space. Therefore it is important to me that the format is simple enough for >> a one pass parser. >> >> I cannot really imagine who would use a library with object oriented API >> to read dump files. No matter what it would be inefficient and have fewer >> features and possibilities than using a real database. >> >> I could live with a binary format, but I have doubts if it is a good >> idea. It will be harder to take sure that your parser is working correctly, >> and you have to consider things like endianness, size of integers, format >> of floats etc. which give no problems in text formats. The binary files may >> be smaller uncompressed (which I don't store anyway) but not necessary when >> compressed, as the compression will do better on text files. >> >> Regards, >> - Byrial >> >> >> ______________________________**_________________ >> Xmldatadumps-l mailing list >> Xmldatadumps-l@lists.**wikimedia.org <Xmldatadumps-l@lists.wikimedia.org> >> https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l> >> > > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > >
_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l