Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 6:06 PM, Gregory Maxwell gmaxw...@gmail.com wrote: snip No one wants the monolithic tarball. The way I got updates previously was via a rsync push. No one sane would suggest a monolithic tarball: it's too much of a pain to produce! I know that You didn't want or use a tarball, but requests for an image dump are not that uncommon and often the requester is envisioning something like a tarball. Arguably that is what the originator of this thread seems to have been asking for. I think you and I are probably mostly on the same page about the virtue of ensuring that images can be distributed and that monolithic approaches are bad. snip But I think producing subsets is pretty much worthless. I can't think of a valid use for any reasonably sized subset. (All media used on big wiki X is a useful subset I've produced for people before, but it's not small enough to be a big win vs a full copy) Wikipedia itself has gotten so large that increasingly people are mirroring subsets rather than allocate the space for a full mirror (e.g. 1 pages on cooking, or medicine, or whatever). Grabbing images needed for such an application would be useful. I can also see virtues in having a way grab all images in a category (or set of categories). For example, grab all images of dogs, or all images of Barack Obama. In case you think this is all hypothetical, I've actually downloaded tens of thousands of images on more than one occasion to support topical projects. snip If all is made available then everyone's wants can be satisfied. No subset is going to get us there. Of course, there are a lot of possibilities for the means of transmission, but I think it would be most useful to assume that at least a few people are going to want to grab everything. Of course, strictly speaking we already provide HTTP access to everything. So the real question is how can we make access easier, more reliable, and less burdensome. You or someone else suggested an API for grabbing files and that seems like a good idea. Ultimately the best answer may well be to take multiple approaches to accommodate both people like you who want everything as well as people that want only more modest collections. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
Robert Rohde wrote: Of course, strictly speaking we already provide HTTP access to everything. So the real question is how can we make access easier, more reliable, and less burdensome. You or someone else suggested an API for grabbing files and that seems like a good idea. Ultimately the best answer may well be to take multiple approaches to accommodate both people like you who want everything as well as people that want only more modest collections. -Robert Rohde Anthony wrote: The bandwidth-saving way to do things would be to just allow mirrors to use hotlinking. Requiring a middle man to temporarily store images (many, and possibly even most of which will never even be downloaded by end users) just wastes bandwidth. There is already a way to instruct a wiki to use images from a foreign wiki as they are needed. With proper caching. On 1.16 it will even be much easier, as you will only need to set $wgUseInstantCommons = true; to use Wikimedia Commons images. http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Sat, Jan 9, 2010 at 7:44 AM, Platonides platoni...@gmail.com wrote: Robert Rohde wrote: Of course, strictly speaking we already provide HTTP access to everything. So the real question is how can we make access easier, more reliable, and less burdensome. You or someone else suggested an API for grabbing files and that seems like a good idea. Ultimately the best answer may well be to take multiple approaches to accommodate both people like you who want everything as well as people that want only more modest collections. -Robert Rohde Anthony wrote: The bandwidth-saving way to do things would be to just allow mirrors to use hotlinking. Requiring a middle man to temporarily store images (many, and possibly even most of which will never even be downloaded by end users) just wastes bandwidth. There is already a way to instruct a wiki to use images from a foreign wiki as they are needed. With proper caching. On 1.16 it will even be much easier, as you will only need to set $wgUseInstantCommons = true; to use Wikimedia Commons images. http://www.mediawiki.org/wiki/Manual:$wgUseInstantCommons ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l I'd really like to underline this last piece, as it's something I feel we're not promoting as heavily as we should be--with 1.16 making it a 1-line switch to turn it on, perhaps we should publicize this. Thanks to work Brion did in 1.13 and I picked up later on, this ability to use files from Wikimedia Commons (or potentially any MediaWiki installation). Pointed out above, this has configurable caching that can be set as aggressively as you'd like. To mirror Wikipedia these days, all you'd need is the article and template dumps, point the ForeignAPIRepos at Commons and enwiki, and you've got yourself a working mirror. No need to dump the images and reimport them somewhere. Cache the thumbnails aggressively enough and you'll be hosting the images locally, in effect. -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Sat, Jan 9, 2010 at 9:27 AM, Carl (CBM) cbm.wikipe...@gmail.com wrote: On Sat, Jan 9, 2010 at 8:50 AM, Anthony wikim...@inbox.org wrote: The original version of Instant Commons had it right. The files were sent straight from the WMF to the client. That version still worked last I checked, but my understanding is that it was deprecated in favor of the bandwidth-wasting store files in a caching middle-man. If I were a site admin using InstantCommons, I would want to keep a copy of all the images used anyway, in case they were deleted on commons but I still wanted to use them on my wiki. - Carl ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l A valid suggestion, but I think it should be configurable either way. Some sites will like to use Wikimedia Commons but don't necessarily have the space to store thumbnails (much less the original sources). However, a copy source file too option could be added in, for sites that would also like to fetch the original source file and then import it locally. None of this is out of the realm of possibilities. The main reason we went for the render there, show thumbnail here idea was to increase compatibility. Not everyone has their wikis set up to render things like SVGs. By rendering remotely, you're assuming the source repo like Commons was set up to render it (a valid assumption). By importing the image locally, you're then possibly requesting remote files that you can't render. Again, more configuration options for the different use cases are possible. -Chad ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
* Gregory Maxwell gmaxw...@gmail.com [Fri, 8 Jan 2010 21:06:11 -0500]: No one wants the monolithic tarball. The way I got updates previously was via a rsync push. No one sane would suggest a monolithic tarball: it's too much of a pain to produce! Image dump != monolithic tarball. Why not to extend the filerepo to make rsync or similar (maybe more efficient) incremental backups easy? Incremental distributed filerepo. Dmitriy ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 9:40 PM, Anthony wikim...@inbox.org wrote: Isn't that what the system immutable flag is for? No, that's for confusing the real roots while providing only a speed bump to an actual hacker. Anyone with root access can always just unset the flag. Or, failing that, dd if=/dev/zero of=/dev/sda works pretty well. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Sat, Jan 9, 2010 at 11:09 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: On Fri, Jan 8, 2010 at 9:40 PM, Anthony wikim...@inbox.org wrote: Isn't that what the system immutable flag is for? No, that's for confusing the real roots while providing only a speed bump to an actual hacker. Anyone with root access can always just unset the flag. Or, failing that, dd if=/dev/zero of=/dev/sda works pretty well. Depends on the machine's securelevel. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Sat, Jan 9, 2010 at 11:40 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: On Sat, Jan 9, 2010 at 11:26 PM, Anthony wikim...@inbox.org wrote: Depends on the machine's securelevel. Google informs me that securelevel is a BSD feature. Wikimedia uses Linux and Solaris. Well, Greg's comment wasn't specific to Linux or Solaris. In any case, I don't know about Solaris, but Linux seems to have some sort of CAP_LINUX_IMMUTABLE and CAP_SYS_RAWIO. I'm sure Solaris has something similar. It doesn't hurt to have extra copies out there Certainly not. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
Hello, Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps. from: http://en.wikipedia.org/wiki/Wikipedia_database#Images_and_uploaded_files Currently Wikipedia does not allow or provide facilities to download all images. As of 17 May 2007 (2007 -05-17)[update], Wikipedia disabled or neglected all viable bulk downloads of images including torrent trackers. Therefore, there is no way to download image dumps other than scraping Wikipedia pages up or using Wikix, which converts a database dump into a series of scripts to fetch the images. Unlike most article text, images are not necessarily licensed under the GFDL CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from download.wikimedia.org. In conclusion, download these images at your own risk (Legal) Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here. Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly. Also I think that having bittorrent setup for this would cost wikipedia a small amount, and may save money in the long run, as well as encourage people to experiment with offline encyclopedia usage etc. To make people have to crawl wikipedia with Wikix if they want to download the images is a bad solution, as it means that the images are downloaded inefficiently. Also one wikix user reported that his download connection was cutoff by a wikipedia admin for remote downloading. Unless there are legal reasons for not allowing images to be downloaded, I think the wikipedia image files should be made available for efficient download again. However since wikix can theoretically be used to download the images, I think it would also be legal to allow the image dump to be downloaded as well, thoughts? cheers, Jamie William ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 4:31 PM, Jamie Morken jmor...@shaw.ca wrote: Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly. Also I think that having bittorrent setup for this would cost wikipedia a small amount, and may save money in the long run, as well as encourage people to experiment with offline encyclopedia usage etc. To make people have to crawl wikipedia with Wikix if they want to download the images is a bad solution, as it means that the images are downloaded inefficiently. Also one wikix user reported that his download connection was cutoff by a wikipedia admin for remote downloading. The problem with BitTorrent is that it is unsuitable for rapidly changing data sets, such as images. If you want to add a single file to the torrent, the entire torrent hash changes, meaning that you end up with separate peer pools for every different data set, although they mostly contain the same files. That said, it could of course be benificial for an initial dump download and is better than the current situation where there is nothing available at all. Bryan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote: I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps. I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users. Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.) Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly. Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall. One gigabit per second adds up to about 10.5 terabytes per day, so say 300 terabytes per month. I'm pretty sure the average figure is more like five or ten Gbps than one, so let's say a petabyte a month at least Ten people per month downloading an extra terabyte is not a big issue. And I really doubt we'd see that many people downloading a full image dump every month. The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that. Then you could get an old copy of the files from anywhere (including Bittorrent, if you like) and only have to download the changes. Plus, you could get up-to-the-minute copies if you like, although probably some throttling should be put into place to stop dozens of people from all running rsync in a loop to make sure they have the absolute latest version. I believe rsync 2 doesn't handle such huge numbers of files acceptably, but I heard rsync 3 is supposed to be much better. That sounds like a better direction to look in than Bittorrent -- nobody's going to want to redownload the same files constantly to get an up-to-date set. Unless there are legal reasons for not allowing images to be downloaded, I think the wikipedia image files should be made available for efficient download again. I'm pretty sure the reason there's no image dump is purely because not enough resources have been devoted to getting it working acceptably. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor simetrical+wikil...@gmail.com wrote: On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote: I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps. I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a Correct. The space wasn't available for the required intermediate cop(y|ies). terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users. s/terabyte/several terabytes/ My copy is not up to date, but it's not smaller than 4. Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.) We also dump the licensing information. If we can lawfully put the images on website then we can also distribute them in dump form. There is and can be no licensing problem. Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall. http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png Though only this part is paid for: http://www.nedworks.org/~mark/reqstats/transitstats-daily.png The rest is peering, etc. which is only paid for in the form of equipment, port fees, and operational costs. The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that. This was how I maintained a running mirror for a considerable time. Unfortunately the process broke when WMF ran out of space and needed to switch servers. On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote: Bittorrent is simply a more efficient method to distribute files, No. In a very real absolute sense bittorrent is considerably less efficient than other means. Bittorrent moves more of the outbound traffic to the edges of the network where the real cost per gbit/sec is much greater than at major datacenters, because a megabit on a low speed link is more costly than a megabit on a high speed link and a megabit on 1 mile of fiber is more expensive than a megabit on 10 feet of fiber. More over, bittorrent is topology unaware so the path length tends to approach the internet average mean path length. Datacenters tend to be more centrally located topology wise, and topology aware distribution is easily applied to centralized stores. (E.g. WMF satisfies requests from Europe in europe, though not for the dump downloads as there simply isn't enough traffic to justify it) Bittorrent also is a more complicated, higher overhead service which requires more memory and more disk IO than traditional transfer mechanisms. There are certainly cases where bittorrent is valuable, such as the flash mob case of a new OS release. This really isn't one of those cases. On Thu, Jan 7, 2010 at 11:52 AM, William Pietri will...@scissor.com wrote: On 01/07/2010 01:40 AM, Jamie Morken wrote: I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...] Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here. We tried BT for the commons poty archive once while I was watching and we never had a downloader stay connected long enough to help another downloader... and that was only 500mb, much easier to seed. BT also makes the server costs a lot higher: it has more cpu/memory overhead, and creates a lot of random disk IO. For low volume large files it's often not much of a win. I haven't seen the numbers for a long time, but when I last looked download.wikimedia.org was producing fairly little traffic... and much of what it was producing was outside of the peak busy hour for the sites. Since the transit is paid for on the 95th percentile and the WMF still has a decent day/night swing out of peak traffic is effectively free. The bandwidth is nothing to worry about. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org
Re: [Wikitech-l] downloading wikipedia database dumps
William Pietri wrote: On 01/07/2010 01:40 AM, Jamie Morken wrote: I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...] Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. No, bandwidth is not really the problem here. I think the core issue is to have bulk access to images. There have been a number of these requests in the past and after talking back and forth, it has usually been the case that a smaller subset of the data works just as well. A good example of this was the Deutsche Fotokek archive made late last year. http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB ) This provided an easily retrievable high quality subset of our image data which researchers could use. Now if we were to snapshot image data and store them for a particular project the amount of duplicate image data would become significant. That's because we re-use a ton of image data between projects and rightfully so. If instead we package all of commons into a tarball then we get roughly 6T's of image data which after numerous conversation has been a bit more then most people want to process. So what does everyone think of going down the collections route? If we provide enough different and up to date ones then we could easily give people a large but manageable amount of data to work with. If there is a page already for this then please feel free to point me to it otherwise I'll create one. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 8:24 AM, Gregory Maxwell gmaxw...@gmail.com wrote: s/terabyte/several terabytes/ My copy is not up to date, but it's not smaller than 4. Top most versions of Commons files are about 4.9 TB, files on enwiki but not Commons add another 200 GB or so. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
I think having access to them on Commons repository is much easier to handle. A subset should be good enough. Having 11 TB of images needs huge research capabilities in order to handle all of them and work with all of them. Maybe a special API or advanced API functions would allow people enough access and at the same time save the bandwidth and the hassle to handle this behemoth collection. bilal -- Verily, with hardship comes ease. On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tf...@wikimedia.org wrote: William Pietri wrote: On 01/07/2010 01:40 AM, Jamie Morken wrote: I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...] Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. No, bandwidth is not really the problem here. I think the core issue is to have bulk access to images. There have been a number of these requests in the past and after talking back and forth, it has usually been the case that a smaller subset of the data works just as well. A good example of this was the Deutsche Fotokek archive made late last year. http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB ) This provided an easily retrievable high quality subset of our image data which researchers could use. Now if we were to snapshot image data and store them for a particular project the amount of duplicate image data would become significant. That's because we re-use a ton of image data between projects and rightfully so. If instead we package all of commons into a tarball then we get roughly 6T's of image data which after numerous conversation has been a bit more then most people want to process. So what does everyone think of going down the collections route? If we provide enough different and up to date ones then we could easily give people a large but manageable amount of data to work with. If there is a page already for this then please feel free to point me to it otherwise I'll create one. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 3:28 PM, Bilal Abdul Kader bila...@gmail.com wrote: I think having access to them on Commons repository is much easier to handle. A subset should be good enough. Having 11 TB of images needs huge research capabilities in order to handle all of them and work with all of them. Maybe a special API or advanced API functions would allow people enough access and at the same time save the bandwidth and the hassle to handle this behemoth collection. Well, if there were an rsyncd you could just fetch the ones you wanted arbitrarily. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
Well, if there were an rsyncd you could just fetch the ones you wanted arbitrarily. rsyncd is fail for large file mass delivery, and it is fail when exposed to masses. Domas ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
Can someone articulate what the use case is? Is there someone out there who could use a 5 TB image archive but is disappointed it doesn't exist? Seems rather implausible. If not, then I assume that everyone is really after only some subset of the files. If that's the case we should try to figure out what kinds of subsets and the best way to handle them. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
Gregory Maxwell wrote: Er. I've maintained a non-WMF disaster recovery archive for a long time, though its no longer completely current since the rsync went away and web fetching is lossy. And the box run out of disk space. We could try until it fills again, though. A sysadmin fixing images with wrong hashes would also be nice https://bugzilla.wikimedia.org/show_bug.cgi?id=17057#c3 It saved our rear a number of times, saving thousands of images from irreparable loss. Moreover it allowed things like image hashing before we had that in the database, and it would allow perceptual lossy hash matching if I ever got around to implementing tools to access the output. IMHO the problem is not accessing it, but hashing those terabytes of images. There really are use cases. Moreover, making complete copies of the public data available as dumps to the public is a WMF board supported initiative. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that. The bandwidth-saving way to do things would be to just allow mirrors to use hotlinking. Requiring a middle man to temporarily store images (many, and possibly even most of which will never even be downloaded by end users) just wastes bandwidth. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On Fri, Jan 8, 2010 at 9:06 PM, Gregory Maxwell gmaxw...@gmail.com wrote: Yea, well, you can't easily eliminate all the internal points of failure. someone with root loses control of their access and someone nasty wipes everything is really hard to protect against with online systems. Isn't that what the system immutable flag is for? It's easy, as long as you're willing to put up with a bit of whining from the person with root access. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] downloading wikipedia database dumps
Hi, I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people. best regards, Jamie Morken ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
Jamie Morken wrote: Hi, I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people. best regards, Jamie Morken Has been tried before (when they were smaller). How many people do you think will have the necessary space and be willing to download it? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
I have been using the dumps for few months and I think this kind of dumps is much better than a torrent. Yes bandwidth can be saved but I do not think the the cost of bandwidth is higher than the cost of maintaining the torrents. If people are not hosting the files so the value of torrents is limited. I think regular mirroring is much better but it all depends on the willingness of people to host the files. bilal -- Verily, with hardship comes ease. On Thu, Jan 7, 2010 at 11:30 AM, Platonides platoni...@gmail.com wrote: Jamie Morken wrote: Hi, I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. I think it is important that wikipedia can be downloaded for using it offline now and in the future for people. best regards, Jamie Morken Has been tried before (when they were smaller). How many people do you think will have the necessary space and be willing to download it? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] downloading wikipedia database dumps
On 01/07/2010 01:40 AM, Jamie Morken wrote: I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...] Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem. Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here. William ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l