[Xmldatadumps-l] Suggested file format of new incremental dumps

2013-07-01 Thread Petr Onderka
For my GSoC project Incremental data dumps [1], I'm creating a new file
format to replace Wikimedia's XML data dumps.
A sketch of how I imagine the file format to look like is at
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.

What do you think? Does it make sense? Would it work for your use case?
Any comments or suggestions are welcome.

Petr Onderka
[[User:Svick]]

[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Tyler Romeo
What is the intended format of the dump files? The page makes it sound like
it will have a binary format, which I'm not opposed to, but is definitely
something you should decide on.

Also, I really like the idea of writing it in a low level language and then
having bindings for something higher. However, unless you plan of having
multiple language bindings (e.g., *both* C# and Python), you may want to
pick a different route. For example, if you decide to only bind to Python,
you can use something like Cython, which would allow you to write
pseudo-Python that is still compiled to C. Of course, if you want multiple
language bindings, this is likely no longer an option.

*-- *
*Tyler Romeo*
Stevens Institute of Technology, Class of 2016
Major in Computer Science
www.whizkidztech.com | tylerro...@gmail.com


On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsv...@gmail.com wrote:

 For my GSoC project Incremental data dumps [1], I'm creating a new file
 format to replace Wikimedia's XML data dumps.
 A sketch of how I imagine the file format to look like is at
 http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.

 What do you think? Does it make sense? Would it work for your use case?
 Any comments or suggestions are welcome.

 Petr Onderka
 [[User:Svick]]

 [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
 ___
 Wikitech-l mailing list
 wikitec...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Ariel T. Glenn
Στις 01-07-2013, ημέρα Δευ, και ώρα 16:00 +0200, ο/η Petr Onderka
έγραψε:
 For my GSoC project Incremental data dumps [1], I'm creating a new file
 format to replace Wikimedia's XML data dumps.
 A sketch of how I imagine the file format to look like is at
 http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
 
 What do you think? Does it make sense? Would it work for your use case?
 Any comments or suggestions are welcome.
 
 Petr Onderka
 [[User:Svick]]
 
 [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps

Dumps v 2.0 finally on the horizon!

A few comments/questions:

I was envisioning that we would produce diff dumps in one pass
(presumably in a much shorter time than the fulls we generate now) and
would apply those against previous fulls (in the new format) to produce
new fulls, hopefully also in less time.  What do you have in mind for
the production of the new fulls?

It might be worth seeing how large the resulting en wp history files are
going to be if you compress each revision separaately for version 1 of
this project.  My fear is that even with 7z it's going to make the size
unwieldy.  If the thought is that it's a first round prototype, not
meant to be run on large projects, that's another story.

I'm not sure about removing the restrictions data; someone must have
wanted it, like the other various fields that have crept in over time.
And we should expect there will be more such fields over time...

We need to get some of the wikidata users in on the model/format
dicussion, to see what use they plan to make of those fields and what
would be most convenient for them.

It's quite likely that these new fulls will need to be split into chunks
much as we do with the current en wp files.  I don't know what that
would mean for the diff files.  Currently we split in an arbitrary way
based on sequences of page numbers, writing out separate stub files and
using those for the content dumps.  Any thoughts?

Ariel




___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Petr Onderka

 What is the intended format of the dump files? The page makes it sound like
 it will have a binary format, which I'm not opposed to, but is definitely
 something you should decide on.


Yes, it is a binary format, I will make that clearer on the page.

The advantage of a binary format is that it's smaller, which I think is
quite important.

I think the main advantages of text-based formats is that there are lots of
tools for the common ones (XML and JSON) and that they are human readable.
But those tools wouldn't be very useful, because we certainly want to have
some sort of custom compression scheme and the tools wouldn't be able to
work with that.
And I think human readability is mostly useful if we want others to be able
to write their own code that directly accesses the data.
And, because of the custom compression, doing that won't be that easy
anyway. And hopefully, it won't be necessary, because there will be a nice
library usable by everyone (see below).


 Also, I really like the idea of writing it in a low level language and then
 having bindings for something higher. However, unless you plan of having
 multiple language bindings (e.g., *both* C# and Python), you may want to
 pick a different route. For example, if you decide to only bind to Python,
 you can use something like Cython, which would allow you to write
 pseudo-Python that is still compiled to C. Of course, if you want multiple
 language bindings, this is likely no longer an option.


Right now, everyone can read the dumps in their favorite language.
If I write the library interface well, writing bindings for it for another
language should be relatively trivial, so everyone can keep using their
favorite language.

And I admit, I'm proposing doing it this way partially because of selfish
reasons: I'd like to use this library in my future C# code.
But I realize creating something that works only in C# doesn't make sense,
because most people in this community don't use it.
So, to me writing the code so that it can be used from anywhere makes the
most sense

Petr Onderka


  On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsv...@gmail.com wrote:

  For my GSoC project Incremental data dumps [1], I'm creating a new file
  format to replace Wikimedia's XML data dumps.
  A sketch of how I imagine the file format to look like is at
  http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format.
 
  What do you think? Does it make sense? Would it work for your use case?
  Any comments or suggestions are welcome.
 
  Petr Onderka
  [[User:Svick]]
 
  [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
  ___
  Wikitech-l mailing list
  wikitec...@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 wikitec...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Petr Onderka

 I was envisioning that we would produce diff dumps in one pass
 (presumably in a much shorter time than the fulls we generate now) and
 would apply those against previous fulls (in the new format) to produce
 new fulls, hopefully also in less time.  What do you have in mind for
 the production of the new fulls?


What I originally imagined is that the full dump will be modified directly
and a description of the changes made to it will be also written to the
diff dump.
But now I think that creating the diff and then applying it makes more
sense, because it's simpler.
But I also think that doing the two at the same time will be faster,
because it's less work (no need to read and parse the diff).
So what I imagine now is something like this:

1. Read information about a change in a page/revision
2. Create diff object in memory
3. Write the diff object to the diff file
4. Apply the diff object to the full dump


 It might be worth seeing how large the resulting en wp history files are
 going to be if you compress each revision separaately for version 1 of
 this project.  My fear is that even with 7z it's going to make the size
 unwieldy.  If the thought is that it's a first round prototype, not
 meant to be run on large projects, that's another story.


I do expect that full dump of enwiki using this compression would be way
too big.
So yes, this was meant just to have something working, so that I can
concentrate on doing compression properly later (after the mid-term).


 I'm not sure about removing the restrictions data; someone must have
 wanted it, like the other various fields that have crept in over time.
 And we should expect there will be more such fields over time...


If I understand the code in XmlDumpWriter.openPage correctly, that data
comes from the page_restrictions row [1], which doesn't seem to be used in
non-ancient versions of MediaWiki.

I did think about versioning the page and revision objects in the dump, but
I'm not sure how exactly to handle upgrades from one version to another.
For now, I think I'll have just one global data version per file, but
I'll make sure that adding a version to each object in the future will be
possible.


 We need to get some of the wikidata users in on the model/format
 discussion, to see what use they plan to make of those fields and what
 would be most convenient for them.

 It's quite likely that these new fulls will need to be split into chunks
 much as we do with the current en wp files.  I don't know what that
 would mean for the diff files.  Currently we split in an arbitrary way
 based on sequences of page numbers, writing out separate stub files and
 using those for the content dumps.  Any thoughts?


If possible, I would prefer to keep everything in a single file.
If that won't be possible, I think it makes sense to split on page ids, but
make the split id visible (probably in the file name) and unchanging  from
month to month.
If it turns out that a single chunk grows too big, we might consider adding
a split instruction to diff dumps, but that's probably not necessary now.

Petr Onderka

[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Giovanni Luca Ciampaglia
+1

And given how messy the revision data can be, having the possibility of
actually inspecting it with a text editor is a great boon.

That said, there may be other use cases that I am not aware of for which a
binary format might be useful, but if you just need to parse and pipe to a
DB, text is the best option.

Giovanni
On Jul 1, 2013 5:10 PM, Byrial Jensen byr...@vip.cybercity.dk wrote:

 Hi,

 As a regular of user of dump files I would not want a fancy file format
 with indexes stored as trees etc.

 I parse all the dump files (both for SQL tables and the XML files) with a
 one pass parser which inserts the data I want (which sometimes is only a
 small fraction of the total amount of data in the file) into my local
 database. I will normally never store uncompressed dump files, but pipe the
 uncompressed data directly from bunzip or gunzip to my parser to save disk
 space. Therefore it is important to me that the format is simple enough for
 a one pass parser.

 I cannot really imagine who would use a library with object oriented API
 to read dump files. No matter what it would be inefficient and have fewer
 features and possibilities than using a real database.

 I could live with a binary format, but I have doubts if it is a good idea.
 It will be harder to take sure that your parser is working correctly, and
 you have to consider things like endianness, size of integers, format of
 floats etc. which give no problems in text formats. The binary files may be
 smaller uncompressed (which I don't store anyway) but not necessary when
 compressed, as the compression will do better on text files.

 Regards,
 - Byrial


 __**_
 Xmldatadumps-l mailing list
 Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Nicolas Torzec
Hi there,

In principle, I understand the need for binary formats and compression in a 
context with limited resources.
On the other hand, plain text formats are easy to work with, especially for 
third-party users and organizations.

Playing the devil advocate, I could even argue that you should keep the data 
dumps in plain text, and keep your processing dead simple, and then let 
distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) 
handle the scale and compute diffs whenever needed or on the fly.

Reading the Wiki mentioned at the beginning of this thread, it is not clear to 
me what the requirements are for this new incremental update format, and why?
Therefore, it is not easy to provide input and help.


Cheers.
- Nicolas Torzec.


PS: Anyway, thanks a lot for your great work on the data backends, behind the 
scene ;)




From: Petr Onderka gsv...@gmail.commailto:gsv...@gmail.com
Date: Monday, July 1, 2013 11:15 AM
To: Wikimedia developers 
wikitec...@lists.wikimedia.orgmailto:wikitec...@lists.wikimedia.org
Cc: Wikipedia Xmldatadumps-l 
xmldatadumps-l@lists.wikimedia.orgmailto:xmldatadumps-l@lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new 
incremental dumps

I was envisioning that we would produce diff dumps in one pass
(presumably in a much shorter time than the fulls we generate now) and
would apply those against previous fulls (in the new format) to produce
new fulls, hopefully also in less time.  What do you have in mind for
the production of the new fulls?

What I originally imagined is that the full dump will be modified directly and 
a description of the changes made to it will be also written to the diff dump.
But now I think that creating the diff and then applying it makes more sense, 
because it's simpler.
But I also think that doing the two at the same time will be faster, because 
it's less work (no need to read and parse the diff).
So what I imagine now is something like this:

1. Read information about a change in a page/revision
2. Create diff object in memory
3. Write the diff object to the diff file
4. Apply the diff object to the full dump

It might be worth seeing how large the resulting en wp history files are
going to be if you compress each revision separaately for version 1 of
this project.  My fear is that even with 7z it's going to make the size
unwieldy.  If the thought is that it's a first round prototype, not
meant to be run on large projects, that's another story.

I do expect that full dump of enwiki using this compression would be way too 
big.
So yes, this was meant just to have something working, so that I can 
concentrate on doing compression properly later (after the mid-term).

I'm not sure about removing the restrictions data; someone must have
wanted it, like the other various fields that have crept in over time.
And we should expect there will be more such fields over time...

If I understand the code in XmlDumpWriter.openPage correctly, that data comes 
from the page_restrictions row [1], which doesn't seem to be used in 
non-ancient versions of MediaWiki.

I did think about versioning the page and revision objects in the dump, but I'm 
not sure how exactly to handle upgrades from one version to another.
For now, I think I'll have just one global data version per file, but I'll 
make sure that adding a version to each object in the future will be possible.

We need to get some of the wikidata users in on the model/format
discussion, to see what use they plan to make of those fields and what
would be most convenient for them.

It's quite likely that these new fulls will need to be split into chunks
much as we do with the current en wp files.  I don't know what that
would mean for the diff files.  Currently we split in an arbitrary way
based on sequences of page numbers, writing out separate stub files and
using those for the content dumps.  Any thoughts?

If possible, I would prefer to keep everything in a single file.
If that won't be possible, I think it makes sense to split on page ids, but 
make the split id visible (probably in the file name) and unchanging  from 
month to month.
If it turns out that a single chunk grows too big, we might consider adding a 
split instruction to diff dumps, but that's probably not necessary now.

Petr Onderka

[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l