Re: [Wikitech-l] downloading wikipedia database dumps

Gregory Maxwell Fri, 08 Jan 2010 18:06:58 -0800

On Fri, Jan 8, 2010 at 8:25 PM, Robert Rohde <raro...@gmail.com> wrote:
> While I certainly can't fault your good will, I do find it disturbing
> that it was necessary.  Ideally, Wikimedia should have internal
> backups of sufficient quality that we don't have to depend on what
> third parties happen to have saved for any circumstance short of
> meteors falling from the heavens.


Yea, well, you can't easily eliminate all the internal points of
failure. "someone with root loses control of their access and someone
nasty wipes everything" is really hard to protect against with online
systems.

Avoiding the case where some failure is reliably replicated among all
of WMF's copies (which was the case in the deletions I recovered, they
were redundant copies, which were deleted too) can be best
accomplished with an air-gap.

And meteors *do* fall, if rarely. WMF can be robust against that— for
only the price of making all the data available, which is something
worth doing for other principled and practical reasons.

Within wikimedia means that Wikimedia remains a single point of
failure. This is too easy to avoid.  Disk space is cheap, and not your
problem.  At least a few third parties will create and maintain full
copies and this is a good thing.

>> Moreover it allowed things like image hashing before
>> we had that in the database, and it would allow perceptual lossy hash
>> matching if I ever got around to implementing tools to access the
>> output.
>
> If the goal is some version of "do something useful for Wikimedia",
> then it actually seems rather bizarre to have the first step be "copy
> X TB of gradually changing data to privately owned and managed
> servers".  For Wikimedia applications, it would seem much more natural
> to make tools and technology available to do such things within
> Wikimedia.  That way developers could  work on such problems without
> having to worry about how much disk space they can personally afford.
> Again, there is nothing wrong with you generously doing such things
> with your own resources, but ideally running duplicate repositories
> for the benefit of Wikimedia should be unnecessary.


Within wikimedia means within Wikimedia's means, priorities, and
politics.  Having it locally means that if I decide that I want to
decide to saturate a dozen cores computing perceptual hashes for a
week I don't have to convince anyone else that its a good use of
resources.  I don't have to convince wikimedia to fund a project, I
don't have to take up resources which might be better used by someone
else, I don't have to set any expectations that I might not live up
to.

Of course, its great to have public resources 'locally' (which is what
the toolserver is for), it doesn't cover all cases.


>> There really are use cases.  Moreover, making complete copies of the
>> public data available as dumps to the public is a WMF board supported
>> initiative.
>
> I agree with the goal of making WMF content available, but given that
> we don't offer any image dump right now and a comprehensive dump as
> such would be usable to almost no one, then I don't think a classic
> dump is where we should start.  Even you don't seem to want that.  If
> I understand correctly, you'd like to have an easier way to reliably
> download individual image files.  You wouldn't actually want to be
> presented with some form of monolithic multi-terabyte tarball each
> month.

No one wants the monolithic tarball. The way I got updates previously
was via a rsync push.

No one sane would suggest a monolithic tarball: it's too much of a
pain to produce!

Image dump != monolithic tarball.

But I think producing subsets is pretty much worthless. I can't think
of a valid use for any reasonably sized subset.  ("All media used on
big wiki X" is a useful subset I've produced for people before, but
it's not small enough to be a big win vs a full copy)

[snip]
> The general point I am trying to make is that if we think about what
> people really want, and how the files are likely to be used, then
> there may be better delivery approaches than trying to create huge
> image dumps.

If all is made available then everyone's wants can be satisfied. No
subset is going to get us there. Of course, there are a lot of
possibilities for the means of transmission, but I think it would be
most useful to assume that at least a few people are going to want to
grab everything.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

Reply via email to