[Wikitech-l] dataset1, xml dumps

2010-12-14 Thread Ariel T. Glenn
For folks who have not been following the saga on http://wikitech.wikimedia.org/view/Dataset1 we were able to get the raid array back in service last night on the XML data dumps server, and we are now busily copying data off of it to another host. There's about 11T of dumps to copy over; once tha

Re: [Wikitech-l] dataset1, xml dumps

2010-12-14 Thread Brion Vibber
Great news! Thanks for the update and thanks for all you guys' work getting it beaten back into shape. Keeping fingers crossed for all going well on the transfer... -- brion On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" wrote: > For folks who have not been following the saga on > http://wikitech.wiki

Re: [Wikitech-l] dataset1, xml dumps

2010-12-14 Thread Diederik van Liere
+1 Diederik On 2010-12-14, at 12:02, Brion Vibber wrote: > Great news! Thanks for the update and thanks for all you guys' work getting > it beaten back into shape. Keeping fingers crossed for all going well on the > transfer... > > -- brion > On Dec 14, 2010 1:12 AM, "Ariel T. Glenn" wrote: >

Re: [Wikitech-l] dataset1, xml dumps

2010-12-14 Thread emijrp
Thanks. Double good news: http://lists.wikimedia.org/pipermail/foundation-l/2010-December/063088.html 2010/12/14 Ariel T. Glenn > For folks who have not been following the saga on > http://wikitech.wikimedia.org/view/Dataset1 > we were able to get the raid array back in service last night on th

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Ariel T. Glenn
We now have a copy of the dumps on a backup host. Although we are still resolving hardware issues on the XML dumps server, we think it is safe enough to serve the existing dumps read-only. DNS was updated to that effect already; people should see the dumps within the hour. Ariel ___

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread masti
Good news, but looking form a professional point of view having them just on array will be leading to such outages. Any idea to have a tape backup or mirror? masti On 12/15/2010 08:57 PM, Ariel T. Glenn wrote: > We now have a copy of the dumps on a backup host. Although we are still > resolving

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Ariel T. Glenn
Currently the files have been copied off of the server onto a backup host, which is the only reason we feel safe about serving them again. We will be getting a new host (it is due to be shipped soon) which will host the live data. The current server will have a backup copy. That is the short term

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Anthony
On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn wrote: > We are interested in other mirrors of the dumps; see > > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps On the talk page, it says "torrents are useful to save bandwidth, which is not our problem". If bandwidth is not

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Ariel T. Glenn
Στις 15-12-2010, ημέρα Τετ, και ώρα 15:57 -0500, ο/η Anthony έγραψε: > On Wed, Dec 15, 2010 at 3:30 PM, Ariel T. Glenn wrote: > > We are interested in other mirrors of the dumps; see > > > > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps > > On the talk page, it says "torren

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Bryan Tong Minh
On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn wrote: > > We certainly want people to host it as well.  It's not a matter of > bandwidth but of protection: if someone can't get to our copy for > whatever reason, another copy is accessible. > Is there a copy in Amsterdam? Seems like that would be

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Ariel T. Glenn
Στις 15-12-2010, ημέρα Τετ, και ώρα 22:50 +0100, ο/η Bryan Tong Minh έγραψε: > On Wed, Dec 15, 2010 at 10:03 PM, Ariel T. Glenn wrote: > > > > We certainly want people to host it as well. It's not a matter of > > bandwidth but of protection: if someone can't get to our copy for > > whatever reaso

Re: [Wikitech-l] dataset1, xml dumps

2010-12-15 Thread Lars Aronsson
On 12/15/2010 09:30 PM, Ariel T. Glenn wrote: > We are interested in other mirrors of the dumps; see > > http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps Just as a small-scale experiment, I tried to mirror the Faroese (fowiki) and Sami (sewiki) language projects. But "wget -m"

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
Ariel T. Glenn wikimedia.org> writes: > > We now have a copy of the dumps on a backup host. Although we are still > resolving hardware issues on the XML dumps server, we think it is safe > enough to serve the existing dumps read-only. DNS was updated to that > effect already; people should see

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread emijrp
Have you checked the md5sum? 2010/12/16 Gabriel Weinberg > Ariel T. Glenn wikimedia.org> writes: > > > > > We now have a copy of the dumps on a backup host. Although we are still > > resolving hardware issues on the XML dumps server, we think it is safe > > enough to serve the existing dumps r

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt ). I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm? On Thu, Dec 16, 2010

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread emijrp
If the md5s don't match, the files are obviously different, I mean, one of them is corrupt. What is the size of your local file? I use to download dumps with wget UNIX command and I don't get errors. If you are using FAT32, the file size is limited to 2 GB and the file is truncated. Is your case?

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
I've been downloading this file (using wget on ubuntu or fetch on FreeBSD) with no issues for years. The current one is 6.2GB as it should be. On Thu, Dec 16, 2010 at 5:53 PM, emijrp wrote: > If the md5s don't match, the files are obviously different, I mean, one of > them is corrupt. > > What i

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Ariel T. Glenn
I was able to unzip a copy of the file on another host (taken from the same location) without problems. On the download host itself I get the correct md5sum: 7a4805475bba1599933b3acd5150bd4d Ariel Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg έγραψε: > md5sum doesn't match

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
Thx--I guess I'll try again--third time's the charm I suppose :) Sorry to waste your time, Gabriel On Thu, Dec 16, 2010 at 6:13 PM, Ariel T. Glenn wrote: > I was able to unzip a copy of the file on another host (taken from the > same location) without problems. On the download host itself I g

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Platonides
Gabriel Weinberg wrote: > md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as > opposed to 7a4805475bba1599933b3acd5150bd4d > on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt > ). > > I've downloaded it twice now and have gotten the same md5sum. Can anyone

Re: [Wikitech-l] dataset1, xml dumps

2010-12-20 Thread Ariel T. Glenn
Google donated storage space for backups for XML dumps. Accordingly, a copy of the latest complete dump for each project is being copied over (public files only). We expect to run similar copies once every two weeks, keeping the four latest copies as well as one permanent copy at every six month

Re: [Wikitech-l] dataset1, xml dumps

2010-12-28 Thread Ed Summers
On Wed, Dec 15, 2010 at 4:56 PM, Ariel T. Glenn wrote: > We want people besides us to host it.  We expect to put a copy at the > new data center (at least), as well. Does anyone know if the Wikipedia XML Data AWS Public Dataset [1] is being routinely updated? It's showing a last update of "Septem