Re: [Wikitech-l] downloading wikipedia database dumps

Bilal Abdul Kader Fri, 08 Jan 2010 12:29:33 -0800

I think having access to them on Commons repository is much easier to
handle. A subset should be good enough.


Having 11 TB of images needs huge research capabilities in order to handle
all of them and work with all of them.

Maybe a special API or advanced API functions would allow people enough
access and at the same time save the bandwidth and the hassle to handle this
behemoth collection.

bilal
--
Verily, with hardship comes ease.


On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc <tf...@wikimedia.org> wrote:

> William Pietri wrote:
> > On 01/07/2010 01:40 AM, Jamie Morken wrote:
> >> I have a
> >> suggestion for wikipedia!!  I think that the database dumps including
> >> the image files should be made available by a wikipedia bittorrent
> >> tracker so that people would be able to download the wikipedia backups
> >> including the images (which currently they can't do) and also so that
> >> wikipedia's bandwidth costs would be reduced. [...]
> >>
> >
> > Is the bandwidth used really a big problem? Bandwidth is pretty cheap
> > these days, and given Wikipedia's total draw, I suspect the occasional
> > dump download isn't much of a problem.
>
> No, bandwidth is not really the problem here. I think the core issue is
> to have bulk access to images.
>
> There have been a number of these requests in the past and after talking
>  back and forth, it has usually been the case that a smaller subset of
> the data works just as well.
>
> A good example of this was the Deutsche Fotokek archive made late last
> year.
>
> http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )
>
> This provided an easily retrievable high quality subset of our image
> data which researchers could use.
>
> Now if we were to snapshot image data and store them for a particular
> project the amount of duplicate image data would become significant.
> That's because we re-use a ton of image data between projects and
> rightfully so.
>
> If instead we package all of commons into a tarball then we get roughly
> 6T's of image data which after numerous conversation has been a bit more
> then most people want to process.
>
> So what does everyone think of going down the collections route?
>
> If we provide enough different and up to date ones then we could easily
> give people a large but manageable amount of data to work with.
>
> If there is a page already for this then please feel free to point me to
> it otherwise I'll create one.
>
> --tomasz
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

Reply via email to