Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

Giovanni Luca Ciampaglia Wed, 03 Jul 2013 07:47:03 -0700

Petr, could you please elaborate more on this last claim? If turning the
dump generation into an incremental process is the task you are interested
in solving, then I don't understand how text constitutes a problem. Text
files can be appended to as any regular file and it shouldn't be difficult
to do this in a way that preserves the XML structure valid.


As I said, having the possibility to seek and inspect the files manually is
a tremendous boon when debugging your code. With what you propose that
would be possible but more complicate, since one cannot seek at a specific
position of stdout without going through the whole contents.

Best

Giovanni
On Jul 3, 2013 4:05 PM, "Petr Onderka" <gsv...@gmail.com> wrote:

> A reply to all those who basically want to keep the current XML dumps:
>
> I have decided to change the primary way of reading the dumps: it will now
> be a command line application that outputs the data as uncompressed XML, in
> the same format as current dumps.
>
> This way, you should be able to use the new dumps with minimal changes to
> your code.
>
> Keeping the dumps in a text-based format doesn't make sense, because that
> can't be updated efficiently, which is the whole reason for the new dumps.
>
> Petr Onderka
>
>
> On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byr...@vip.cybercity.dk>wrote:
>
>> Hi,
>>
>> As a regular of user of dump files I would not want a "fancy" file format
>> with indexes stored as trees etc.
>>
>> I parse all the dump files (both for SQL tables and the XML files) with a
>> one pass parser which inserts the data I want (which sometimes is only a
>> small fraction of the total amount of data in the file) into my local
>> database. I will normally never store uncompressed dump files, but pipe the
>> uncompressed data directly from bunzip or gunzip to my parser to save disk
>> space. Therefore it is important to me that the format is simple enough for
>> a one pass parser.
>>
>> I cannot really imagine who would use a library with object oriented API
>> to read dump files. No matter what it would be inefficient and have fewer
>> features and possibilities than using a real database.
>>
>> I could live with a binary format, but I have doubts if it is a good
>> idea. It will be harder to take sure that your parser is working correctly,
>> and you have to consider things like endianness, size of integers, format
>> of floats etc. which give no problems in text formats. The binary files may
>> be smaller uncompressed (which I don't store anyway) but not necessary when
>> compressed, as the compression will do better on text files.
>>
>> Regards,
>> - Byrial
>>
>>
>> ______________________________**_________________
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l@lists.**wikimedia.org <Xmldatadumps-l@lists.wikimedia.org>
>> https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>
>>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

Reply via email to