Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Robert Rohde
On Fri, Jan 8, 2010 at 6:06 PM, Gregory Maxwell gmaxw...@gmail.com wrote:
snip

 No one wants the monolithic tarball. The way I got updates previously
 was via a rsync push.

 No one sane would suggest a monolithic tarball: it's too much of a
 pain to produce!

I know that You didn't want or use a tarball, but requests for an
image dump are not that uncommon and often the requester is
envisioning something like a tarball.  Arguably that is what the
originator of this thread seems to have been asking for.  I think you
and I are probably mostly on the same page about the virtue of
ensuring that images can be distributed and that monolithic approaches
are bad.

snip

 But I think producing subsets is pretty much worthless. I can't think
 of a valid use for any reasonably sized subset.  (All media used on
 big wiki X is a useful subset I've produced for people before, but
 it's not small enough to be a big win vs a full copy)

Wikipedia itself has gotten so large that increasingly people are
mirroring subsets rather than allocate the space for a full mirror
(e.g. 1 pages on cooking, or medicine, or whatever).  Grabbing
images needed for such an application would be useful.  I can also see
virtues in having a way grab all images in a category (or set of
categories).  For example, grab all images of dogs, or all images of
Barack Obama.  In case you think this is all hypothetical, I've
actually downloaded tens of thousands of images on more than one
occasion to support topical projects.

snip

 If all is made available then everyone's wants can be satisfied. No
 subset is going to get us there. Of course, there are a lot of
 possibilities for the means of transmission, but I think it would be
 most useful to assume that at least a few people are going to want to
 grab everything.

Of course, strictly speaking we already provide HTTP access to
everything.  So the real question is how can we make access easier,
more reliable, and less burdensome.  You or someone else suggested an
API for grabbing files and that seems like a good idea.  Ultimately
the best answer may well be to take multiple approaches to accommodate
both people like you who want everything as well as people that want
only more modest collections.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Platonides
Robert Rohde wrote:
 Of course, strictly speaking we already provide HTTP access to
 everything.  So the real question is how can we make access easier,
 more reliable, and less burdensome.  You or someone else suggested an
 API for grabbing files and that seems like a good idea.  Ultimately
 the best answer may well be to take multiple approaches to accommodate
 both people like you who want everything as well as people that want
 only more modest collections.
 
 -Robert Rohde

Anthony wrote:
 The bandwidth-saving way to do things would be to just allow mirrors to use
 hotlinking.  Requiring a middle man to temporarily store images (many, and
 possibly even most of which will never even be downloaded by end users) just
 wastes bandwidth.


There is already a way to instruct a wiki to use images from a foreign
wiki as they are needed. With proper caching.

On 1.16 it will even be much easier, as you will only need to set
$wgUseInstantCommons = true; to use Wikimedia Commons images.
http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Chad
On Sat, Jan 9, 2010 at 7:44 AM, Platonides platoni...@gmail.com wrote:
 Robert Rohde wrote:
 Of course, strictly speaking we already provide HTTP access to
 everything.  So the real question is how can we make access easier,
 more reliable, and less burdensome.  You or someone else suggested an
 API for grabbing files and that seems like a good idea.  Ultimately
 the best answer may well be to take multiple approaches to accommodate
 both people like you who want everything as well as people that want
 only more modest collections.

 -Robert Rohde

 Anthony wrote:
 The bandwidth-saving way to do things would be to just allow mirrors to use
 hotlinking.  Requiring a middle man to temporarily store images (many, and
 possibly even most of which will never even be downloaded by end users) just
 wastes bandwidth.


 There is already a way to instruct a wiki to use images from a foreign
 wiki as they are needed. With proper caching.

 On 1.16 it will even be much easier, as you will only need to set
 $wgUseInstantCommons = true; to use Wikimedia Commons images.
 http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


I'd really like to underline this last piece, as it's something I feel
we're not promoting as heavily as we should be--with 1.16 making
it a 1-line switch to turn it on, perhaps we should publicize this.
Thanks to work Brion did in 1.13 and I picked up later on, this
ability to use files from Wikimedia Commons (or potentially any
MediaWiki installation). Pointed out above, this has configurable
caching that can be set as aggressively as you'd like.

To mirror Wikipedia these days, all you'd need is the article and
template dumps, point the ForeignAPIRepos at Commons and
enwiki, and you've got yourself a working mirror. No need to dump
the images and reimport them somewhere. Cache the thumbnails
aggressively enough and you'll be hosting the images locally, in
effect.

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Chad
On Sat, Jan 9, 2010 at 9:27 AM, Carl (CBM) cbm.wikipe...@gmail.com wrote:
 On Sat, Jan 9, 2010 at 8:50 AM, Anthony wikim...@inbox.org wrote:
 The original version of Instant Commons had it right.  The files were sent
 straight from the WMF to the client.  That version still worked last I
 checked, but my understanding is that it was deprecated in favor of the
 bandwidth-wasting store files in a caching middle-man.

 If I were a site admin using InstantCommons, I would want to keep a
 copy of all the images used anyway, in case they were deleted on
 commons but I still wanted to use them on my wiki.

 - Carl

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


A valid suggestion, but I think it should be configurable either
way. Some sites will like to use Wikimedia Commons but don't
necessarily have the space to store thumbnails (much less
the original sources).

However, a copy source file too option could be added in, for
sites that would also like to fetch the original source file and
then import it locally. None of this is out of the realm of
possibilities.

The main reason we went for the render there, show thumbnail
here idea was to increase compatibility. Not everyone has their
wikis set up to render things like SVGs. By rendering remotely,
you're assuming the source repo like Commons was set up to
render it (a valid assumption). By importing the image locally,
you're then possibly requesting remote files that you can't render.

Again, more configuration options for the different use cases
are possible.

-Chad

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Dmitriy Sintsov
* Gregory Maxwell gmaxw...@gmail.com [Fri, 8 Jan 2010 21:06:11 -0500]:

 No one wants the monolithic tarball. The way I got updates previously
 was via a rsync push.

 No one sane would suggest a monolithic tarball: it's too much of a
 pain to produce!

 Image dump != monolithic tarball.

Why not to extend the filerepo to make rsync or similar (maybe more 
efficient) incremental backups easy? Incremental distributed filerepo.
Dmitriy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Aryeh Gregor
On Fri, Jan 8, 2010 at 9:40 PM, Anthony wikim...@inbox.org wrote:
 Isn't that what the system immutable flag is for?

No, that's for confusing the real roots while providing only a speed
bump to an actual hacker.  Anyone with root access can always just
unset the flag.  Or, failing that, dd if=/dev/zero of=/dev/sda works
pretty well.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Anthony
On Sat, Jan 9, 2010 at 11:09 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 On Fri, Jan 8, 2010 at 9:40 PM, Anthony wikim...@inbox.org wrote:
  Isn't that what the system immutable flag is for?

 No, that's for confusing the real roots while providing only a speed
 bump to an actual hacker.  Anyone with root access can always just
 unset the flag.  Or, failing that, dd if=/dev/zero of=/dev/sda works
 pretty well.


Depends on the machine's securelevel.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-09 Thread Anthony
On Sat, Jan 9, 2010 at 11:40 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 On Sat, Jan 9, 2010 at 11:26 PM, Anthony wikim...@inbox.org wrote:
  Depends on the machine's securelevel.

 Google informs me that securelevel is a BSD feature.  Wikimedia uses
 Linux and Solaris.


Well, Greg's comment wasn't specific to Linux or Solaris.  In any case, I
don't know about Solaris, but Linux seems to have some sort of
CAP_LINUX_IMMUTABLE and CAP_SYS_RAWIO.  I'm sure Solaris has something
similar.


 It doesn't hurt to have extra copies out there


Certainly not.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Jamie Morken

Hello,

 Is the bandwidth used really a big problem? Bandwidth is pretty 
 cheap 
 these days, and given Wikipedia's total draw, I suspect the 
 occasional 
 dump download isn't much of a problem.

I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
are no longer available on the wikipedia dump anyway.  I am guessing they were 
removed partly because of the bandwidth cost, or else image licensing issues 
perhaps.

from:
http://en.wikipedia.org/wiki/Wikipedia_database#Images_and_uploaded_files

Currently Wikipedia does not allow or provide facilities to download all 
images. As of 17 May 2007 (2007 -05-17)[update],
Wikipedia disabled or neglected all viable bulk downloads of images
including torrent trackers. Therefore, there is no way to download
image dumps other than scraping Wikipedia pages up or using Wikix, which 
converts a database dump into a series of scripts to fetch the images.

Unlike most article text, images are not necessarily licensed under the GFDL  
CC-BY-SA-3.0. They may be under one of many free licenses, in the public 
domain, believed to be fair use, or even copyright infringements (which should 
be deleted).
In particular, use of fair use images outside the context of Wikipedia
or similar works may be illegal. Images under most licenses require a
credit, and possibly other attached copyright information. This
information is included in image description pages, which are part of
the text dumps available from download.wikimedia.org. In conclusion, download 
these images at your own risk (Legal)
 
 Bittorrent's real strength is when a lot of people want to 
 download the 
 same thing at once. E.g., when a new Ubuntu release comes out. 
 Since 
 Bittorrent requires all downloaders to be uploaders, it turns 
 the flood 
 of users into a benefit. But unless somebody has stats 
 otherwise, I'd 
 guess that isn't the problem here.

Bittorrent is simply a more efficient method to distribute files, especially if 
the much larger wikipedia image files were made available again.  The last dump 
from english wikipedia including images is over 200GB but is understandably not 
available for download. Even if there are only 10 people per month who download 
these large files, bittorrent should be able to reduce the bandwidth cost to 
wikipedia significantly.  Also I think that having bittorrent setup for this 
would cost wikipedia a small amount, and may save money in the long run, as 
well as encourage people to experiment with offline encyclopedia usage etc.  To 
make people have to crawl wikipedia with Wikix if they want to download the 
images is a bad solution, as it means that the images are downloaded 
inefficiently.  Also one wikix user reported that his download connection was 
cutoff by a wikipedia admin for remote downloading.  

Unless there are legal reasons for not allowing images to be downloaded, I 
think the wikipedia image files should be made available for efficient download 
again.  However since wikix can theoretically be used to download the images, I 
think it would also be legal to allow the image dump to be downloaded as well, 
thoughts?

cheers,
Jamie



 
 William
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Bryan Tong Minh
On Fri, Jan 8, 2010 at 4:31 PM, Jamie Morken jmor...@shaw.ca wrote:

 Bittorrent is simply a more efficient method to distribute files, especially 
 if the much larger wikipedia image files were made available again.  The last 
 dump from english wikipedia including images is over 200GB but is 
 understandably not available for download. Even if there are only 10 people 
 per month who download these large files, bittorrent should be able to reduce 
 the bandwidth cost to wikipedia significantly.  Also I think that having 
 bittorrent setup for this would cost wikipedia a small amount, and may save 
 money in the long run, as well as encourage people to experiment with offline 
 encyclopedia usage etc.  To make people have to crawl wikipedia with Wikix if 
 they want to download the images is a bad solution, as it means that the 
 images are downloaded inefficiently.  Also one wikix user reported that his 
 download connection was cutoff by a wikipedia admin for remote downloading.


The problem with BitTorrent is that it is unsuitable for rapidly
changing data sets, such as images. If you want to add a single file
to the torrent, the entire torrent hash changes, meaning that you end
up with separate peer pools for every different data set, although
they mostly contain the same files.

That said, it could of course be benificial for an initial dump
download and is better than the current situation where there is
nothing available at all.


Bryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Aryeh Gregor
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
 are no longer available on the wikipedia dump anyway.  I am guessing they 
 were removed partly because of the bandwidth cost, or else image licensing 
 issues perhaps.

I think we just don't have infrastructure set up to dump images.  I'm
very sure bandwidth is not an issue -- the number of people with a
terabyte (or is it more?) handy that they want to download a Wikipedia
image dump to will be vanishingly small compared to normal users.
Licensing wouldn't be an issue for Commons, at least, as long as it's
easy to link the images up to their license pages.  (I imagine it
would technically violate some licenses, but no one would probably
worry about it.)

 Bittorrent is simply a more efficient method to distribute files, especially 
 if the much larger wikipedia image files were made available again.  The last 
 dump from english wikipedia including images is over 200GB but is 
 understandably not available for download. Even if there are only 10 people 
 per month who download these large files, bittorrent should be able to reduce 
 the bandwidth cost to wikipedia significantly.

Wikipedia uses an average of multiple gigabits per second of
bandwidth, as I recall.  One gigabit per second adds up to about 10.5
terabytes per day, so say 300 terabytes per month.  I'm pretty sure
the average figure is more like five or ten Gbps than one, so let's
say a petabyte a month at least  Ten people per month downloading an
extra terabyte is not a big issue.  And I really doubt we'd see that
many people downloading a full image dump every month.

The sensible bandwidth-saving way to do it would be to set up an rsync
daemon on the image servers, and let people use that.  Then you could
get an old copy of the files from anywhere (including Bittorrent, if
you like) and only have to download the changes.  Plus, you could get
up-to-the-minute copies if you like, although probably some throttling
should be put into place to stop dozens of people from all running
rsync in a loop to make sure they have the absolute latest version.  I
believe rsync 2 doesn't handle such huge numbers of files acceptably,
but I heard rsync 3 is supposed to be much better.  That sounds like a
better direction to look in than Bittorrent -- nobody's going to want
to redownload the same files constantly to get an up-to-date set.

 Unless there are legal reasons for not allowing images to be downloaded, I 
 think the wikipedia image files should be made available for efficient 
 download again.

I'm pretty sure the reason there's no image dump is purely because not
enough resources have been devoted to getting it working acceptably.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Gregory Maxwell
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
 are no longer available on the wikipedia dump anyway.  I am guessing they 
 were removed partly because of the bandwidth cost, or else image licensing 
 issues perhaps.

 I think we just don't have infrastructure set up to dump images.  I'm
 very sure bandwidth is not an issue -- the number of people with a

Correct. The space wasn't available for the required intermediate cop(y|ies).

 terabyte (or is it more?) handy that they want to download a Wikipedia
 image dump to will be vanishingly small compared to normal users.

s/terabyte/several terabytes/  My copy is not up to date, but it's not
smaller than 4.

 Licensing wouldn't be an issue for Commons, at least, as long as it's
 easy to link the images up to their license pages.  (I imagine it
 would technically violate some licenses, but no one would probably
 worry about it.)

We also dump the licensing information. If we can lawfully put the
images on website then we can also distribute them in dump form. There
is and can be no licensing problem.

 Wikipedia uses an average of multiple gigabits per second of
 bandwidth, as I recall.

http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png

Though only this part is paid for:
http://www.nedworks.org/~mark/reqstats/transitstats-daily.png

The rest is peering, etc. which is only paid for in the form of
equipment, port fees, and operational costs.

 The sensible bandwidth-saving way to do it would be to set up an rsync
 daemon on the image servers, and let people use that.

This was how I maintained a running mirror for a considerable time.

Unfortunately the process broke when WMF ran out of space and needed
to switch servers.

On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 Bittorrent is simply a more efficient method to distribute files,

No. In a very real absolute sense bittorrent is considerably less
efficient than other means.

Bittorrent moves more of the outbound traffic to the edges of the
network where the real cost per gbit/sec is much greater than at major
datacenters, because a megabit on a low speed link is more costly than
a megabit on a high speed link and a megabit on 1 mile of fiber is
more expensive than a megabit on 10 feet of fiber.

More over, bittorrent is topology unaware so the path length tends to
approach the internet average mean path length. Datacenters tend to be
more centrally located topology wise, and topology aware distribution
is easily applied to centralized stores. (E.g. WMF satisfies requests
from Europe in europe, though not for the dump downloads as there
simply isn't enough traffic to justify it)

Bittorrent also is a more complicated, higher overhead service which
requires more memory and more disk IO than traditional transfer
mechanisms.

There are certainly cases where bittorrent is valuable, such as the
flash mob case of a new OS release. This really isn't one of those
cases.

On Thu, Jan 7, 2010 at 11:52 AM, William Pietri will...@scissor.com wrote:
 On 01/07/2010 01:40 AM, Jamie Morken wrote:
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced. [...]


 Is the bandwidth used really a big problem? Bandwidth is pretty cheap
 these days, and given Wikipedia's total draw, I suspect the occasional
 dump download isn't much of a problem.

 Bittorrent's real strength is when a lot of people want to download the
 same thing at once. E.g., when a new Ubuntu release comes out. Since
 Bittorrent requires all downloaders to be uploaders, it turns the flood
 of users into a benefit. But unless somebody has stats otherwise, I'd
 guess that isn't the problem here.

We tried BT for the commons poty archive once while I was watching and
we never had a downloader stay connected long enough to help another
downloader... and that was only 500mb, much easier to seed.

BT also makes the server costs a lot higher: it has more cpu/memory
overhead, and creates a lot of random disk IO.  For low volume large
files it's often not much of a win.

I haven't seen the numbers for a long time, but when I last looked
download.wikimedia.org was producing fairly little traffic... and much
of what it was producing was outside of the peak busy hour for the
sites.  Since the transit is paid for on the 95th percentile and the
WMF still has a decent day/night swing out of peak traffic is
effectively free.  The bandwidth is nothing to worry about.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org

Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Tomasz Finc
William Pietri wrote:
 On 01/07/2010 01:40 AM, Jamie Morken wrote:
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced. [...]

 
 Is the bandwidth used really a big problem? Bandwidth is pretty cheap 
 these days, and given Wikipedia's total draw, I suspect the occasional 
 dump download isn't much of a problem.

No, bandwidth is not really the problem here. I think the core issue is 
to have bulk access to images.

There have been a number of these requests in the past and after talking 
  back and forth, it has usually been the case that a smaller subset of 
the data works just as well.

A good example of this was the Deutsche Fotokek archive made late last 
year.

http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )

This provided an easily retrievable high quality subset of our image 
data which researchers could use.

Now if we were to snapshot image data and store them for a particular 
project the amount of duplicate image data would become significant. 
That's because we re-use a ton of image data between projects and 
rightfully so.

If instead we package all of commons into a tarball then we get roughly 
6T's of image data which after numerous conversation has been a bit more 
then most people want to process.

So what does everyone think of going down the collections route?

If we provide enough different and up to date ones then we could easily 
give people a large but manageable amount of data to work with.

If there is a page already for this then please feel free to point me to 
it otherwise I'll create one.

--tomasz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Robert Rohde
On Fri, Jan 8, 2010 at 8:24 AM, Gregory Maxwell gmaxw...@gmail.com wrote:
 s/terabyte/several terabytes/  My copy is not up to date, but it's not
 smaller than 4.

Top most versions of Commons files are about 4.9 TB, files on enwiki
but not Commons add another 200 GB or so.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Bilal Abdul Kader
I think having access to them on Commons repository is much easier to
handle. A subset should be good enough.

Having 11 TB of images needs huge research capabilities in order to handle
all of them and work with all of them.

Maybe a special API or advanced API functions would allow people enough
access and at the same time save the bandwidth and the hassle to handle this
behemoth collection.

bilal
--
Verily, with hardship comes ease.


On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tf...@wikimedia.org wrote:

 William Pietri wrote:
  On 01/07/2010 01:40 AM, Jamie Morken wrote:
  I have a
  suggestion for wikipedia!!  I think that the database dumps including
  the image files should be made available by a wikipedia bittorrent
  tracker so that people would be able to download the wikipedia backups
  including the images (which currently they can't do) and also so that
  wikipedia's bandwidth costs would be reduced. [...]
 
 
  Is the bandwidth used really a big problem? Bandwidth is pretty cheap
  these days, and given Wikipedia's total draw, I suspect the occasional
  dump download isn't much of a problem.

 No, bandwidth is not really the problem here. I think the core issue is
 to have bulk access to images.

 There have been a number of these requests in the past and after talking
  back and forth, it has usually been the case that a smaller subset of
 the data works just as well.

 A good example of this was the Deutsche Fotokek archive made late last
 year.

 http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )

 This provided an easily retrievable high quality subset of our image
 data which researchers could use.

 Now if we were to snapshot image data and store them for a particular
 project the amount of duplicate image data would become significant.
 That's because we re-use a ton of image data between projects and
 rightfully so.

 If instead we package all of commons into a tarball then we get roughly
 6T's of image data which after numerous conversation has been a bit more
 then most people want to process.

 So what does everyone think of going down the collections route?

 If we provide enough different and up to date ones then we could easily
 give people a large but manageable amount of data to work with.

 If there is a page already for this then please feel free to point me to
 it otherwise I'll create one.

 --tomasz


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Aryeh Gregor
On Fri, Jan 8, 2010 at 3:28 PM, Bilal Abdul Kader bila...@gmail.com wrote:
 I think having access to them on Commons repository is much easier to
 handle. A subset should be good enough.

 Having 11 TB of images needs huge research capabilities in order to handle
 all of them and work with all of them.

 Maybe a special API or advanced API functions would allow people enough
 access and at the same time save the bandwidth and the hassle to handle this
 behemoth collection.

Well, if there were an rsyncd you could just fetch the ones you wanted
arbitrarily.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Domas Mituzas
 Well, if there were an rsyncd you could just fetch the ones you wanted
 arbitrarily.

rsyncd is fail for large file mass delivery, and it is fail when exposed to 
masses. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Robert Rohde
Can someone articulate what the use case is?

Is there someone out there who could use a 5 TB image archive but is
disappointed it doesn't exist?  Seems rather implausible.

If not, then I assume that everyone is really after only some subset
of the files.  If that's the case we should try to figure out what
kinds of subsets and the best way to handle them.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Platonides
Gregory Maxwell wrote:
 Er. I've maintained a non-WMF disaster recovery archive for a long
 time, though its no longer completely current since the rsync went
 away and web fetching is lossy.

And the box run out of disk space. We could try until it fills again,
though.

A sysadmin fixing images with wrong hashes would also be nice
https://bugzilla.wikimedia.org/show_bug.cgi?id=17057#c3

 It saved our rear a number of times, saving thousands of images from
 irreparable loss. Moreover it allowed things like image hashing before
 we had that in the database, and it would allow perceptual lossy hash
 matching if I ever got around to implementing tools to access the
 output.

IMHO the problem is not accessing it, but hashing those terabytes of images.


 There really are use cases.  Moreover, making complete copies of the
 public data available as dumps to the public is a WMF board supported
 initiative.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Anthony
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 The sensible bandwidth-saving way to do it would be to set up an rsync
 daemon on the image servers, and let people use that.


The bandwidth-saving way to do things would be to just allow mirrors to use
hotlinking.  Requiring a middle man to temporarily store images (many, and
possibly even most of which will never even be downloaded by end users) just
wastes bandwidth.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Anthony
On Fri, Jan 8, 2010 at 9:06 PM, Gregory Maxwell gmaxw...@gmail.com wrote:

 Yea, well, you can't easily eliminate all the internal points of
 failure. someone with root loses control of their access and someone
 nasty wipes everything is really hard to protect against with online
 systems.


Isn't that what the system immutable flag is for?

It's easy, as long as you're willing to put up with a bit of whining from
the person with root access.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] downloading wikipedia database dumps

2010-01-07 Thread Jamie Morken
Hi,

I have a
suggestion for wikipedia!!  I think that the database dumps including
the image files should be made available by a wikipedia bittorrent
tracker so that people would be able to download the wikipedia backups
including the images (which currently they can't do) and also so that
wikipedia's bandwidth costs would be reduced.  I think it is important
that wikipedia can be downloaded for using it offline now and in the
future for people.

best regards,
Jamie Morken
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-07 Thread Platonides
Jamie Morken wrote:
 Hi,
 
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced.  I think it is important
 that wikipedia can be downloaded for using it offline now and in the
 future for people.
 
 best regards,
 Jamie Morken

Has been tried before (when they were smaller).
How many people do you think will have the necessary space and be
willing to download it?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-07 Thread Bilal Abdul Kader
I have been using the dumps for few months and I think this kind of dumps is
much better than a torrent. Yes bandwidth can be saved but I do not think
the the cost of bandwidth is higher than the cost of maintaining the
torrents.

If people are not hosting the files so the value of torrents is limited.

I think regular mirroring is much better but it all depends on the
willingness of people to host the files.

bilal
--
Verily, with hardship comes ease.


On Thu, Jan 7, 2010 at 11:30 AM, Platonides platoni...@gmail.com wrote:

 Jamie Morken wrote:
  Hi,
 
  I have a
  suggestion for wikipedia!!  I think that the database dumps including
  the image files should be made available by a wikipedia bittorrent
  tracker so that people would be able to download the wikipedia backups
  including the images (which currently they can't do) and also so that
  wikipedia's bandwidth costs would be reduced.  I think it is important
  that wikipedia can be downloaded for using it offline now and in the
  future for people.
 
  best regards,
  Jamie Morken

 Has been tried before (when they were smaller).
 How many people do you think will have the necessary space and be
 willing to download it?


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-07 Thread William Pietri
On 01/07/2010 01:40 AM, Jamie Morken wrote:
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced. [...]


Is the bandwidth used really a big problem? Bandwidth is pretty cheap 
these days, and given Wikipedia's total draw, I suspect the occasional 
dump download isn't much of a problem.

Bittorrent's real strength is when a lot of people want to download the 
same thing at once. E.g., when a new Ubuntu release comes out. Since 
Bittorrent requires all downloaders to be uploaders, it turns the flood 
of users into a benefit. But unless somebody has stats otherwise, I'd 
guess that isn't the problem here.

William

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l