Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

2013-07-01 Thread Nicolas Torzec
Hi there,

In principle, I understand the need for binary formats and compression in a 
context with limited resources.
On the other hand, plain text formats are easy to work with, especially for 
third-party users and organizations.

Playing the devil advocate, I could even argue that you should keep the data 
dumps in plain text, and keep your processing dead simple, and then let 
distributed processing systems such as Hadoop MapReduce (or Storm, Spark, etc.) 
handle the scale and compute diffs whenever needed or on the fly.

Reading the Wiki mentioned at the beginning of this thread, it is not clear to 
me what the requirements are for this new incremental update format, and why?
Therefore, it is not easy to provide input and help.


Cheers.
- Nicolas Torzec.


PS: Anyway, thanks a lot for your great work on the data backends, behind the 
scene ;)




From: Petr Onderka gsv...@gmail.commailto:gsv...@gmail.com
Date: Monday, July 1, 2013 11:15 AM
To: Wikimedia developers 
wikitec...@lists.wikimedia.orgmailto:wikitec...@lists.wikimedia.org
Cc: Wikipedia Xmldatadumps-l 
xmldatadumps-l@lists.wikimedia.orgmailto:xmldatadumps-l@lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new 
incremental dumps

I was envisioning that we would produce diff dumps in one pass
(presumably in a much shorter time than the fulls we generate now) and
would apply those against previous fulls (in the new format) to produce
new fulls, hopefully also in less time.  What do you have in mind for
the production of the new fulls?

What I originally imagined is that the full dump will be modified directly and 
a description of the changes made to it will be also written to the diff dump.
But now I think that creating the diff and then applying it makes more sense, 
because it's simpler.
But I also think that doing the two at the same time will be faster, because 
it's less work (no need to read and parse the diff).
So what I imagine now is something like this:

1. Read information about a change in a page/revision
2. Create diff object in memory
3. Write the diff object to the diff file
4. Apply the diff object to the full dump

It might be worth seeing how large the resulting en wp history files are
going to be if you compress each revision separaately for version 1 of
this project.  My fear is that even with 7z it's going to make the size
unwieldy.  If the thought is that it's a first round prototype, not
meant to be run on large projects, that's another story.

I do expect that full dump of enwiki using this compression would be way too 
big.
So yes, this was meant just to have something working, so that I can 
concentrate on doing compression properly later (after the mid-term).

I'm not sure about removing the restrictions data; someone must have
wanted it, like the other various fields that have crept in over time.
And we should expect there will be more such fields over time...

If I understand the code in XmlDumpWriter.openPage correctly, that data comes 
from the page_restrictions row [1], which doesn't seem to be used in 
non-ancient versions of MediaWiki.

I did think about versioning the page and revision objects in the dump, but I'm 
not sure how exactly to handle upgrades from one version to another.
For now, I think I'll have just one global data version per file, but I'll 
make sure that adding a version to each object in the future will be possible.

We need to get some of the wikidata users in on the model/format
discussion, to see what use they plan to make of those fields and what
would be most convenient for them.

It's quite likely that these new fulls will need to be split into chunks
much as we do with the current en wp files.  I don't know what that
would mean for the diff files.  Currently we split in an arbitrary way
based on sequences of page numbers, writing out separate stub files and
using those for the content dumps.  Any thoughts?

If possible, I would prefer to keep everything in a single file.
If that won't be possible, I think it makes sense to split on page ids, but 
make the split id visible (probably in the file name) and unchanging  from 
month to month.
If it turns out that a single chunk grows too big, we might consider adding a 
split instruction to diff dumps, but that's probably not necessary now.

Petr Onderka

[1]: http://www.mediawiki.org/wiki/Manual:Page_table#page_restrictions
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Wikidata project and interwiki links removed in wiki text

2013-03-04 Thread Nicolas Torzec
Indeed, that will be an issue for everyone that consumes Wikipedia data 
automatically, especially as more structured data (e.g. infobox) will 
eventually move from MediaWiki to Wikidata. DBpedia will have the same issue at 
one point.

Nicolas.

--
Nicolas Torzec
Yahoo! Labs.







From: François Bonzon 
francois.bon...@gmail.commailto:francois.bon...@gmail.com
Date: Monday, March 4, 2013 7:35 AM
To: 
xmldatadumps-l@lists.wikimedia.orgmailto:xmldatadumps-l@lists.wikimedia.org 
xmldatadumps-l@lists.wikimedia.orgmailto:xmldatadumps-l@lists.wikimedia.org
Subject: [Xmldatadumps-l] Wikidata project and interwiki links removed in wiki 
text

Hi,

I understand from http://www.wikidata.org/wiki/Wikidata:News that
- enwiki since February 13, 2013
- hewiki and itwiki since January 30, 2013
- huwiki January 14, 2013
have migrated to the Wikidata project. And more wikis will follow shortly.

One consequence is that wiki markup for interwiki links (cross-language links) 
are being gradually removed from articles, because the MediaWiki software can 
now read them from the centralized Wikidata repository.

I verified in the latest huwiki dump that some articles indeed no more have 
interwiki links. Do you confirm my above statements?

How can I now extract interwiki links from dumps? Is there a separate Wikidata 
dump I should download? What attributes for look for to join Wikidata and 
separate language wiki dumps? Thanks for your help.

-François
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l