[Xmldatadumps-l] Suggested file format of new incremental dumps
For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format. What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome. Petr Onderka [[User:Svick]] [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on. Also, I really like the idea of writing it in a low level language and then having bindings for something higher. However, unless you plan of having multiple language bindings (e.g., *both* C# and Python), you may want to pick a different route. For example, if you decide to only bind to Python, you can use something like Cython, which would allow you to write pseudo-Python that is still compiled to C. Of course, if you want multiple language bindings, this is likely no longer an option. *-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerro...@gmail.com On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsv...@gmail.com wrote: For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format. What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome. Petr Onderka [[User:Svick]] [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps ___ Wikitech-l mailing list wikitec...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
Στις 01-07-2013, ημέρα Δευ, και ώρα 16:00 +0200, ο/η Petr Onderka έγραψε: For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format. What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome. Petr Onderka [[User:Svick]] [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps Dumps v 2.0 finally on the horizon! A few comments/questions: I was envisioning that we would produce diff dumps in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls? It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story. I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time... We need to get some of the wikidata users in on the model/format dicussion, to see what use they plan to make of those fields and what would be most convenient for them. It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts? Ariel ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
What is the intended format of the dump files? The page makes it sound like it will have a binary format, which I'm not opposed to, but is definitely something you should decide on. Yes, it is a binary format, I will make that clearer on the page. The advantage of a binary format is that it's smaller, which I think is quite important. I think the main advantages of text-based formats is that there are lots of tools for the common ones (XML and JSON) and that they are human readable. But those tools wouldn't be very useful, because we certainly want to have some sort of custom compression scheme and the tools wouldn't be able to work with that. And I think human readability is mostly useful if we want others to be able to write their own code that directly accesses the data. And, because of the custom compression, doing that won't be that easy anyway. And hopefully, it won't be necessary, because there will be a nice library usable by everyone (see below). Also, I really like the idea of writing it in a low level language and then having bindings for something higher. However, unless you plan of having multiple language bindings (e.g., *both* C# and Python), you may want to pick a different route. For example, if you decide to only bind to Python, you can use something like Cython, which would allow you to write pseudo-Python that is still compiled to C. Of course, if you want multiple language bindings, this is likely no longer an option. Right now, everyone can read the dumps in their favorite language. If I write the library interface well, writing bindings for it for another language should be relatively trivial, so everyone can keep using their favorite language. And I admit, I'm proposing doing it this way partially because of selfish reasons: I'd like to use this library in my future C# code. But I realize creating something that works only in C# doesn't make sense, because most people in this community don't use it. So, to me writing the code so that it can be used from anywhere makes the most sense Petr Onderka On Mon, Jul 1, 2013 at 10:00 AM, Petr Onderka gsv...@gmail.com wrote: For my GSoC project Incremental data dumps [1], I'm creating a new file format to replace Wikimedia's XML data dumps. A sketch of how I imagine the file format to look like is at http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format. What do you think? Does it make sense? Would it work for your use case? Any comments or suggestions are welcome. Petr Onderka [[User:Svick]] [1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps ___ Wikitech-l mailing list wikitec...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list wikitec...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
I was envisioning that we would produce diff dumps in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls? What I originally imagined is that the full dump will be modified directly and a description of the changes made to it will be also written to the diff dump. But now I think that creating the diff and then applying it makes more sense, because it's simpler. But I also think that doing the two at the same time will be faster, because it's less work (no need to read and parse the diff). So what I imagine now is something like this: 1. Read information about a change in a page/revision 2. Create diff object in memory 3. Write the diff object to the diff file 4. Apply the diff object to the full dump It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story. I do expect that full dump of enwiki using this compression would be way too big. So yes, this was meant just to have something working, so that I can concentrate on doing compression properly later (after the mid-term). I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time... If I understand the code in XmlDumpWriter.openPage correctly, that data comes from the page_restrictions row [1], which doesn't seem to be used in non-ancient versions of MediaWiki. I did think about versioning the page and revision objects in the dump, but I'm not sure how exactly to handle upgrades from one version to another. For now, I think I'll have just one global data version per file, but I'll make sure that adding a version to each object in the future will be possible. We need to get some of the wikidata users in on the model/format discussion, to see what use they plan to make of those fields and what would be most convenient for them. It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts? If possible, I would prefer to keep everything in a single file. If that won't be possible, I think it makes sense to split on page ids, but make the split id visible (probably in the file name) and unchanging from month to month. If it turns out that a single chunk grows too big, we might consider adding a split instruction to diff dumps, but that's probably not necessary now. Petr Onderka [1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
+1 And given how messy the revision data can be, having the possibility of actually inspecting it with a text editor is a great boon. That said, there may be other use cases that I am not aware of for which a binary format might be useful, but if you just need to parse and pipe to a DB, text is the best option. Giovanni On Jul 1, 2013 5:10 PM, Byrial Jensen byr...@vip.cybercity.dk wrote: Hi, As a regular of user of dump files I would not want a fancy file format with indexes stored as trees etc. I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser. I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database. I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files. Regards, - Byrial __**_ Xmldatadumps-l mailing list Xmldatadumps-l@lists.**wikimedia.org Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**lhttps://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps
Hi there, In principle, I understand the need for binary formats and compression in a context with limited resources. On the other hand, plain text formats are easy to work with, especially for third-party users and organizations. Playing the devil advocate, I could even argue that you should keep the data dumps in plain text, and keep your processing dead simple, and then let distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) handle the scale and compute diffs whenever needed or on the fly. Reading the Wiki mentioned at the beginning of this thread, it is not clear to me what the requirements are for this new incremental update format, and why? Therefore, it is not easy to provide input and help. Cheers. - Nicolas Torzec. PS: Anyway, thanks a lot for your great work on the data backends, behind the scene ;) From: Petr Onderka gsv...@gmail.commailto:gsv...@gmail.com Date: Monday, July 1, 2013 11:15 AM To: Wikimedia developers wikitec...@lists.wikimedia.orgmailto:wikitec...@lists.wikimedia.org Cc: Wikipedia Xmldatadumps-l xmldatadumps-l@lists.wikimedia.orgmailto:xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps I was envisioning that we would produce diff dumps in one pass (presumably in a much shorter time than the fulls we generate now) and would apply those against previous fulls (in the new format) to produce new fulls, hopefully also in less time. What do you have in mind for the production of the new fulls? What I originally imagined is that the full dump will be modified directly and a description of the changes made to it will be also written to the diff dump. But now I think that creating the diff and then applying it makes more sense, because it's simpler. But I also think that doing the two at the same time will be faster, because it's less work (no need to read and parse the diff). So what I imagine now is something like this: 1. Read information about a change in a page/revision 2. Create diff object in memory 3. Write the diff object to the diff file 4. Apply the diff object to the full dump It might be worth seeing how large the resulting en wp history files are going to be if you compress each revision separaately for version 1 of this project. My fear is that even with 7z it's going to make the size unwieldy. If the thought is that it's a first round prototype, not meant to be run on large projects, that's another story. I do expect that full dump of enwiki using this compression would be way too big. So yes, this was meant just to have something working, so that I can concentrate on doing compression properly later (after the mid-term). I'm not sure about removing the restrictions data; someone must have wanted it, like the other various fields that have crept in over time. And we should expect there will be more such fields over time... If I understand the code in XmlDumpWriter.openPage correctly, that data comes from the page_restrictions row [1], which doesn't seem to be used in non-ancient versions of MediaWiki. I did think about versioning the page and revision objects in the dump, but I'm not sure how exactly to handle upgrades from one version to another. For now, I think I'll have just one global data version per file, but I'll make sure that adding a version to each object in the future will be possible. We need to get some of the wikidata users in on the model/format discussion, to see what use they plan to make of those fields and what would be most convenient for them. It's quite likely that these new fulls will need to be split into chunks much as we do with the current en wp files. I don't know what that would mean for the diff files. Currently we split in an arbitrary way based on sequences of page numbers, writing out separate stub files and using those for the content dumps. Any thoughts? If possible, I would prefer to keep everything in a single file. If that won't be possible, I think it makes sense to split on page ids, but make the split id visible (probably in the file name) and unchanging from month to month. If it turns out that a single chunk grows too big, we might consider adding a split instruction to diff dumps, but that's probably not necessary now. Petr Onderka [1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l