On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor
<simetrical+wikil...@gmail.com> wrote:
> On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken <jmor...@shaw.ca> wrote:
>> I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
>> are no longer available on the wikipedia dump anyway.  I am guessing they 
>> were removed partly because of the bandwidth cost, or else image licensing 
>> issues perhaps.
>
> I think we just don't have infrastructure set up to dump images.  I'm
> very sure bandwidth is not an issue -- the number of people with a

Correct. The space wasn't available for the required intermediate cop(y|ies).

> terabyte (or is it more?) handy that they want to download a Wikipedia
> image dump to will be vanishingly small compared to normal users.

s/terabyte/several terabytes/  My copy is not up to date, but it's not
smaller than 4.

> Licensing wouldn't be an issue for Commons, at least, as long as it's
> easy to link the images up to their license pages.  (I imagine it
> would technically violate some licenses, but no one would probably
> worry about it.)

We also dump the licensing information. If we can lawfully put the
images on website then we can also distribute them in dump form. There
is and can be no licensing problem.

> Wikipedia uses an average of multiple gigabits per second of
> bandwidth, as I recall.

http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png

Though only this part is paid for:
http://www.nedworks.org/~mark/reqstats/transitstats-daily.png

The rest is peering, etc. which is only paid for in the form of
equipment, port fees, and operational costs.

> The sensible bandwidth-saving way to do it would be to set up an rsync
> daemon on the image servers, and let people use that.

This was how I maintained a running mirror for a considerable time.

Unfortunately the process broke when WMF ran out of space and needed
to switch servers.

On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken <jmor...@shaw.ca> wrote:
> Bittorrent is simply a more efficient method to distribute files,

No. In a very real absolute sense bittorrent is considerably less
efficient than other means.

Bittorrent moves more of the outbound traffic to the edges of the
network where the real cost per gbit/sec is much greater than at major
datacenters, because a megabit on a low speed link is more costly than
a megabit on a high speed link and a megabit on 1 mile of fiber is
more expensive than a megabit on 10 feet of fiber.

More over, bittorrent is topology unaware so the path length tends to
approach the internet average mean path length. Datacenters tend to be
more centrally located topology wise, and topology aware distribution
is easily applied to centralized stores. (E.g. WMF satisfies requests
from Europe in europe, though not for the dump downloads as there
simply isn't enough traffic to justify it)

Bittorrent also is a more complicated, higher overhead service which
requires more memory and more disk IO than traditional transfer
mechanisms.

There are certainly cases where bittorrent is valuable, such as the
flash mob case of a new OS release. This really isn't one of those
cases.

On Thu, Jan 7, 2010 at 11:52 AM, William Pietri <will...@scissor.com> wrote:
> On 01/07/2010 01:40 AM, Jamie Morken wrote:
>> I have a
>> suggestion for wikipedia!!  I think that the database dumps including
>> the image files should be made available by a wikipedia bittorrent
>> tracker so that people would be able to download the wikipedia backups
>> including the images (which currently they can't do) and also so that
>> wikipedia's bandwidth costs would be reduced. [...]
>>
>
> Is the bandwidth used really a big problem? Bandwidth is pretty cheap
> these days, and given Wikipedia's total draw, I suspect the occasional
> dump download isn't much of a problem.
>
> Bittorrent's real strength is when a lot of people want to download the
> same thing at once. E.g., when a new Ubuntu release comes out. Since
> Bittorrent requires all downloaders to be uploaders, it turns the flood
> of users into a benefit. But unless somebody has stats otherwise, I'd
> guess that isn't the problem here.

We tried BT for the commons poty archive once while I was watching and
we never had a downloader stay connected long enough to help another
downloader... and that was only 500mb, much easier to seed.

BT also makes the server costs a lot higher: it has more cpu/memory
overhead, and creates a lot of random disk IO.  For low volume large
files it's often not much of a win.

I haven't seen the numbers for a long time, but when I last looked
download.wikimedia.org was producing fairly little traffic... and much
of what it was producing was outside of the peak busy hour for the
sites.  Since the transit is paid for on the 95th percentile and the
WMF still has a decent day/night swing out of peak traffic is
effectively free.  The bandwidth is nothing to worry about.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to