On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken <jmor...@shaw.ca> wrote:
> I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
> are no longer available on the wikipedia dump anyway.  I am guessing they 
> were removed partly because of the bandwidth cost, or else image licensing 
> issues perhaps.

I think we just don't have infrastructure set up to dump images.  I'm
very sure bandwidth is not an issue -- the number of people with a
terabyte (or is it more?) handy that they want to download a Wikipedia
image dump to will be vanishingly small compared to normal users.
Licensing wouldn't be an issue for Commons, at least, as long as it's
easy to link the images up to their license pages.  (I imagine it
would technically violate some licenses, but no one would probably
worry about it.)

> Bittorrent is simply a more efficient method to distribute files, especially 
> if the much larger wikipedia image files were made available again.  The last 
> dump from english wikipedia including images is over 200GB but is 
> understandably not available for download. Even if there are only 10 people 
> per month who download these large files, bittorrent should be able to reduce 
> the bandwidth cost to wikipedia significantly.

Wikipedia uses an average of multiple gigabits per second of
bandwidth, as I recall.  One gigabit per second adds up to about 10.5
terabytes per day, so say 300 terabytes per month.  I'm pretty sure
the average figure is more like five or ten Gbps than one, so let's
say a petabyte a month at least  Ten people per month downloading an
extra terabyte is not a big issue.  And I really doubt we'd see that
many people downloading a full image dump every month.

The sensible bandwidth-saving way to do it would be to set up an rsync
daemon on the image servers, and let people use that.  Then you could
get an old copy of the files from anywhere (including Bittorrent, if
you like) and only have to download the changes.  Plus, you could get
up-to-the-minute copies if you like, although probably some throttling
should be put into place to stop dozens of people from all running
rsync in a loop to make sure they have the absolute latest version.  I
believe rsync 2 doesn't handle such huge numbers of files acceptably,
but I heard rsync 3 is supposed to be much better.  That sounds like a
better direction to look in than Bittorrent -- nobody's going to want
to redownload the same files constantly to get an up-to-date set.

> Unless there are legal reasons for not allowing images to be downloaded, I 
> think the wikipedia image files should be made available for efficient 
> download again.

I'm pretty sure the reason there's no image dump is purely because not
enough resources have been devoted to getting it working acceptably.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to