Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Brion Vibber wrote: >>> Decompression takes as long as compression with bzip2 >> I think decompression is *faster* than compression >> http://tukaani.org/lzma/benchmarks > > LZMA is nice and fast to decompress... but *insanely* slower to > compress, and doesn't seem as parallelizable. :( > > -- brion I used the lzma benchmark as evidence to support that decompressing bzip2 is faster than compressing. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On Thu, Mar 26, 2009 at 8:51 PM, ERSEK Laszlo wrote: > On 03/27/09 01:14, Brion Vibber wrote: > > > LZMA is nice and fast to decompress... but *insanely* slower to > > compress, and doesn't seem as parallelizable. :( > > The xz file format should allow for "easy" parallelization, both when > compressing and decompressing; see > > http://tukaani.org/xz/xz-file-format.txt > > 3. Block > 3.1. Block Header > 3.1.1. Block Header Size > 3.1.3. Compressed Size > 3.1.4. Uncompressed Size > 3.1.6. Header Padding > 3.3. Block Padding > > At least in theory, this "length-prefixing" should make it fairly > straightforward to write a multi-threaded decompressor with a splitter > that can work from a pipe and is input-bound. I reckon the xz structure > will eventually prove useful even for distributed > compression/decompression. > > lacos It includes an index for random access too. Cool. I wonder what kind of block size you'd need to get a compression ratio approaching that of 7z. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On 03/27/09 01:14, Brion Vibber wrote: > LZMA is nice and fast to decompress... but *insanely* slower to > compress, and doesn't seem as parallelizable. :( The xz file format should allow for "easy" parallelization, both when compressing and decompressing; see http://tukaani.org/xz/xz-file-format.txt 3. Block 3.1. Block Header 3.1.1. Block Header Size 3.1.3. Compressed Size 3.1.4. Uncompressed Size 3.1.6. Header Padding 3.3. Block Padding At least in theory, this "length-prefixing" should make it fairly straightforward to write a multi-threaded decompressor with a splitter that can work from a pipe and is input-bound. I reckon the xz structure will eventually prove useful even for distributed compression/decompression. lacos ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On 3/26/09 3:25 PM, Keisial wrote: > Quite interesting. Can the images at office.wikimedia.org be moved to > somewhere public? I've copied those two to the public wiki. :) >> Decompression takes as long as compression with bzip2 > I think decompression is *faster* than compression > http://tukaani.org/lzma/benchmarks LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :( -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Tomasz Finc wrote: > I've started drafting some new ideas at > http://wikitech.wikimedia.org/view/Data_dump_redesign > > of the various problems that were facing and what kind of job management > we can put around it. Were taking this on as a full "should have been > done 2 years ago" project and I'm going to be shepherding this along. > > Right now I'm collecting stats about the throughput of the components to > see how much in parallel this could be farmed out in a job management > system. > > This is a large project that has some distinct problem areas that we'll > be isolating and welcoming help on. > > --tomasz Quite interesting. Can the images at office.wikimedia.org be moved to somewhere public? >Decompression takes as long as compression with bzip2 I think decompression is *faster* than compression http://tukaani.org/lzma/benchmarks Let me know if I can help with anything. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On 3/25/09 10:08 AM, Christian Storm wrote: > Thanks to everyone who got the enwiki dumps going again! Should we expect > more regular dumps now? What was the final solution of fixing this? > > Lots of love and upkeep by everyone :) But really its needs to be more automated and made parallelised so that we can spot issues faster, validate inconsistencies, and finish quicker. Brion and I have met about this and we've even brought it into the Wikimedia dev meetings to brainstorm how the system could change for the better. I've started drafting some new ideas at http://wikitech.wikimedia.org/view/Data_dump_redesign of the various problems that were facing and what kind of job management we can put around it. Were taking this on as a full "should have been done 2 years ago" project and I'm going to be shepherding this along. Right now I'm collecting stats about the throughput of the components to see how much in parallel this could be farmed out in a job management system. This is a large project that has some distinct problem areas that we'll be isolating and welcoming help on. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
toolserver users dont have access to text On Wed, Mar 25, 2009 at 7:05 PM, Brian wrote: > Perhaps the toolserver can make you a current dump of current en? > > On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm >wrote: > > > Thanks to everyone who got the enwiki dumps going again! Should we > expect > > more regular dumps now? What was the final solution of fixing this? > > > > > > > > > > > > We are having to resort to crawling en.wikipedia.org while we await > > > for regular dumps. > > > What is the minimum crawling delay we can get away with? I figure if we > > > have 1 second delay then we'd be able to crawl the 2+ million articles > > > in a month. > > > > > > I know crawling is discouraged but it seems a lot of parties still do > > > so after looking at robots.txt > > > I have to assume that is how Google et al. is able to keep up to date. > > > > > > Are their private data feeds? I noticed a wg_enwiki dump listed. > > > > > > Christian > > > > > > On Jan 28, 2009, at 10:47 AM, Christian Storm wrote: > > > > > > > That would be great. I second this notion whole heartedly. > > > > > > > > > > > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > > > > > > > >> "Brion Vibber" wrote in message > > > >> news:497f9c35.9050...@wikimedia.org... > > > >>> On 1/27/09 2:55 PM, Robert Rohde wrote: > > > On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber > > > > wrote: > > > > On 1/27/09 2:35 PM, Thomas Dalton wrote: > > > >> The way I see it, what we need is to get a really powerful > server > > > > Nope, it's a software architecture issue. We'll restart it with > > > > the new > > > > arch when it's ready to go. > > > The simplest solution is just to kill the current dump job if you > > > have > > > faith that a new architecture can be put in place in less than a > > > year. > > > >>> > > > >>> We'll probably do that. > > > >>> > > > >>> -- brion > > > >> > > > >> FWIW, I'll add my vote for aborting the current dump *now* if we > > > >> don't > > > >> expect it ever to actually be finished, so we can at least get a > > > >> fresh dump > > > >> of the current pages. > > > >> > > > >> Russ > > > >> > > > >> > > > >> > > > >> > > > >> ___ > > > >> Wikitech-l mailing list > > > >> Wikitech-l@lists.wikimedia.org > > > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > > > > > > > ___ > > > > Wikitech-l mailing list > > > > Wikitech-l@lists.wikimedia.org > > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > > > > ___ > > > Wikitech-l mailing list > > > Wikitech-l@lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Perhaps the toolserver can make you a current dump of current en? On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm wrote: > Thanks to everyone who got the enwiki dumps going again! Should we expect > more regular dumps now? What was the final solution of fixing this? > > > > > > > We are having to resort to crawling en.wikipedia.org while we await > > for regular dumps. > > What is the minimum crawling delay we can get away with? I figure if we > > have 1 second delay then we'd be able to crawl the 2+ million articles > > in a month. > > > > I know crawling is discouraged but it seems a lot of parties still do > > so after looking at robots.txt > > I have to assume that is how Google et al. is able to keep up to date. > > > > Are their private data feeds? I noticed a wg_enwiki dump listed. > > > > Christian > > > > On Jan 28, 2009, at 10:47 AM, Christian Storm wrote: > > > > > That would be great. I second this notion whole heartedly. > > > > > > > > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > > > > > >> "Brion Vibber" wrote in message > > >> news:497f9c35.9050...@wikimedia.org... > > >>> On 1/27/09 2:55 PM, Robert Rohde wrote: > > On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber > > wrote: > > > On 1/27/09 2:35 PM, Thomas Dalton wrote: > > >> The way I see it, what we need is to get a really powerful server > > > Nope, it's a software architecture issue. We'll restart it with > > > the new > > > arch when it's ready to go. > > The simplest solution is just to kill the current dump job if you > > have > > faith that a new architecture can be put in place in less than a > > year. > > >>> > > >>> We'll probably do that. > > >>> > > >>> -- brion > > >> > > >> FWIW, I'll add my vote for aborting the current dump *now* if we > > >> don't > > >> expect it ever to actually be finished, so we can at least get a > > >> fresh dump > > >> of the current pages. > > >> > > >> Russ > > >> > > >> > > >> > > >> > > >> ___ > > >> Wikitech-l mailing list > > >> Wikitech-l@lists.wikimedia.org > > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > > > > ___ > > > Wikitech-l mailing list > > > Wikitech-l@lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Thanks to everyone who got the enwiki dumps going again! Should we expect more regular dumps now? What was the final solution of fixing this? > > We are having to resort to crawling en.wikipedia.org while we await > for regular dumps. > What is the minimum crawling delay we can get away with? I figure if we > have 1 second delay then we'd be able to crawl the 2+ million articles > in a month. > > I know crawling is discouraged but it seems a lot of parties still do > so after looking at robots.txt > I have to assume that is how Google et al. is able to keep up to date. > > Are their private data feeds? I noticed a wg_enwiki dump listed. > > Christian > > On Jan 28, 2009, at 10:47 AM, Christian Storm wrote: > > > That would be great. I second this notion whole heartedly. > > > > > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > > > >> "Brion Vibber" wrote in message > >> news:497f9c35.9050...@wikimedia.org... > >>> On 1/27/09 2:55 PM, Robert Rohde wrote: > On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber > wrote: > > On 1/27/09 2:35 PM, Thomas Dalton wrote: > >> The way I see it, what we need is to get a really powerful server > > Nope, it's a software architecture issue. We'll restart it with > > the new > > arch when it's ready to go. > The simplest solution is just to kill the current dump job if you > have > faith that a new architecture can be put in place in less than a > year. > >>> > >>> We'll probably do that. > >>> > >>> -- brion > >> > >> FWIW, I'll add my vote for aborting the current dump *now* if we > >> don't > >> expect it ever to actually be finished, so we can at least get a > >> fresh dump > >> of the current pages. > >> > >> Russ > >> > >> > >> > >> > >> ___ > >> Wikitech-l mailing list > >> Wikitech-l@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Brion, We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month. I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date. Are their private data feeds? I noticed a wg_enwiki dump listed. Christian On Jan 28, 2009, at 10:47 AM, Christian Storm wrote: > That would be great. I second this notion whole heartedly. > > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > >> "Brion Vibber" wrote in message >> news:497f9c35.9050...@wikimedia.org... >>> On 1/27/09 2:55 PM, Robert Rohde wrote: On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber wrote: > On 1/27/09 2:35 PM, Thomas Dalton wrote: >> The way I see it, what we need is to get a really powerful server > Nope, it's a software architecture issue. We'll restart it with > the new > arch when it's ready to go. The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year. >>> >>> We'll probably do that. >>> >>> -- brion >> >> FWIW, I'll add my vote for aborting the current dump *now* if we >> don't >> expect it ever to actually be finished, so we can at least get a >> fresh dump >> of the current pages. >> >> Russ >> >> >> >> >> ___ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On Thu, Jan 29, 2009 at 11:20 AM, Brion Vibber wrote: > On 1/28/09 8:32 AM, Brion Vibber wrote: >> Probably wise to poke in a hack to skip the history first. :) > > Done in r46545. > > Updated dump scripts and canceled the old enwiki dump. > > New dumps also will be attempting to generate log output as XML which > correctly handles the deletion/oversighting options; we'll see hwo that > goes. :) Is there somewhere that explains (or at least gives an example) of the new logging format and what has changed? -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On 1/28/09 8:32 AM, Brion Vibber wrote: > Probably wise to poke in a hack to skip the history first. :) Done in r46545. Updated dump scripts and canceled the old enwiki dump. New dumps also will be attempting to generate log output as XML which correctly handles the deletion/oversighting options; we'll see hwo that goes. :) -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On Thu, Jan 29, 2009 at 1:52 AM, Gerard Meijssen wrote: > Hoi, > Two things: > > - if we abort the backup now, we do not know if we WILL have something at > the time it would have ended > - if the toolserver data can provide a service as a stop gap measure why > not provide that in the mean time If you want to play the optimist and believe this dump might eventually accomplish something, then the right stopgap would be the hack the dumper so that it periodically regenerates the other files even while the big dump is still running. Such a thing, though definitely a hack, would not be hard to do. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Hoi, Two things: - if we abort the backup now, we do not know if we WILL have something at the time it would have ended - if the toolserver data can provide a service as a stop gap measure why not provide that in the mean time Thanks, GerardM 2009/1/29 Alai > Russell Blau hotmail.com> writes: > > FWIW, I'll add my vote for aborting the current dump *now* if we don't > > expect it ever to actually be finished, so we can at least get a fresh > dump > > of the current pages. > > I'd like to third/fourth/(other ordinal) this idea too. I've been using > the > (in comparison tiny) SQL dumps for various purposes, and it's most vexing > that these have to wait until the end (or lack of any end...) of the larger > XML dumps. (The same data is replicated on the toolserver, of course, but > I'd get beaten to death if I tried to run some of the data collection > scripts > I've been running offline, there.) > > Cheers, > Alai. > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Russell Blau hotmail.com> writes: > FWIW, I'll add my vote for aborting the current dump *now* if we don't > expect it ever to actually be finished, so we can at least get a fresh dump > of the current pages. I'd like to third/fourth/(other ordinal) this idea too. I've been using the (in comparison tiny) SQL dumps for various purposes, and it's most vexing that these have to wait until the end (or lack of any end...) of the larger XML dumps. (The same data is replicated on the toolserver, of course, but I'd get beaten to death if I tried to run some of the data collection scripts I've been running offline, there.) Cheers, Alai. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
That would be great. I second this notion whole heartedly. On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > "Brion Vibber" wrote in message > news:497f9c35.9050...@wikimedia.org... >> On 1/27/09 2:55 PM, Robert Rohde wrote: >>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber >>> wrote: On 1/27/09 2:35 PM, Thomas Dalton wrote: > The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go. >>> The simplest solution is just to kill the current dump job if you >>> have >>> faith that a new architecture can be put in place in less than a >>> year. >> >> We'll probably do that. >> >> -- brion > > FWIW, I'll add my vote for aborting the current dump *now* if we don't > expect it ever to actually be finished, so we can at least get a > fresh dump > of the current pages. > > Russ > > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
Probably wise to poke in a hack to skip the history first. :) -- brion vibber (brion @ wikimedia.org) On Jan 28, 2009, at 7:34, "Russell Blau" wrote: > "Brion Vibber" wrote in message > news:497f9c35.9050...@wikimedia.org... >> On 1/27/09 2:55 PM, Robert Rohde wrote: >>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber >>> wrote: On 1/27/09 2:35 PM, Thomas Dalton wrote: > The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go. >>> The simplest solution is just to kill the current dump job if you >>> have >>> faith that a new architecture can be put in place in less than a >>> year. >> >> We'll probably do that. >> >> -- brion > > FWIW, I'll add my vote for aborting the current dump *now* if we don't > expect it ever to actually be finished, so we can at least get a > fresh dump > of the current pages. > > Russ > > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
"Brion Vibber" wrote in message news:497f9c35.9050...@wikimedia.org... > On 1/27/09 2:55 PM, Robert Rohde wrote: >> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber >> wrote: >>> On 1/27/09 2:35 PM, Thomas Dalton wrote: The way I see it, what we need is to get a really powerful server >>> Nope, it's a software architecture issue. We'll restart it with the new >>> arch when it's ready to go. >> The simplest solution is just to kill the current dump job if you have >> faith that a new architecture can be put in place in less than a year. > > We'll probably do that. > > -- brion FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages. Russ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On 1/27/09 2:55 PM, Robert Rohde wrote: > On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber wrote: >> On 1/27/09 2:35 PM, Thomas Dalton wrote: >>> The way I see it, what we need is to get a really powerful server >> Nope, it's a software architecture issue. We'll restart it with the new >> arch when it's ready to go. > > I don't know what your timetable is, but what about doing something to > address the other aspects of the dump (logs, stubs, etc.) that are in > limbo while full history chugs along. All the other enwiki files are > now 3 months old and that is already enough to inconvenience some > people. > > The simplest solution is just to kill the current dump job if you have > faith that a new architecture can be put in place in less than a year. We'll probably do that. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber wrote: > On 1/27/09 2:35 PM, Thomas Dalton wrote: >> The way I see it, what we need is to get a really powerful server > > Nope, it's a software architecture issue. We'll restart it with the new > arch when it's ready to go. I don't know what your timetable is, but what about doing something to address the other aspects of the dump (logs, stubs, etc.) that are in limbo while full history chugs along. All the other enwiki files are now 3 months old and that is already enough to inconvenience some people. The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
On 1/27/09 2:35 PM, Thomas Dalton wrote: > The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
> Whether we want to let the current process continue to try and finish > or not, I would seriously suggest someone look into redumping the rest > of the enwiki files (i.e. logs, current pages, etc.). I am also among > the people that care about having reasonably fresh dumps and it really > is a problem that the other dumps (e.g. stubs-meta-history) are frozen > while we wait to see if the full history dump can run to completion. Even if we do let it finish, I'm not sure a dump of what Wikipedia was like 13 months ago is much use... The way I see it, what we need is to get a really powerful server to do the dump just once at a reasonable speed and then we'll have a previous dump to build on so future ones would be more reasonable. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
The problem, as I understand it (and Brion may come by to correct me) is essentially that the current dump process is designed in a way that can't be sustained given the size of enwiki. It really needs to be re-engineered, which means that developer time is needed to create a new approach to dumping. The main target for improvement is almost certainly parallelizing the process so that wouldn't be a single monolithic dump process, but rather a lot of little processes working in parallel. That would also ensure that if a single process gets stuck and dies, the entire dump doesn't need to start over. By way of observation, the dewiki's full history dumps in 26 hours with 96% prefetched (i.e. loaded from previous dumps). That suggests that even starting from scratch (prefetch = 0%) it should dump in ~25 days under the current process. enwiki is perhaps 3-6 times larger than dewiki depending on how you do the accounting, which implies dumping the whole thing from scratch would take ~5 months if the process scaled linearly. Of course it doesn't scale linearly, and we end up with a prediction for completion that is currently 10 months away (which amounts to a 13 month total execution). And of course, if there is any serious error in the next ten months the entire process could die with no result. Whether we want to let the current process continue to try and finish or not, I would seriously suggest someone look into redumping the rest of the enwiki files (i.e. logs, current pages, etc.). I am also among the people that care about having reasonably fresh dumps and it really is a problem that the other dumps (e.g. stubs-meta-history) are frozen while we wait to see if the full history dump can run to completion. -Robert Rohde On Tue, Jan 27, 2009 at 11:24 AM, Christian Storm wrote: >>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: >>> The current enwiki database dump >>> (http://download.wikimedia.org/enwiki/20081008/ >>> ) has been crawling along since 10/15/2008. >> The current dump system is not sustainable on very large wikis and >> is being replaced. You'll hear about it when we have the new one in >> place. :) >> -- brion > > Following up on this thread: > http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html > > Brion, > > Can you offer any general timeline estimates (weeks, months, 1/2 > year)? Are there any alternatives to retrieving the article data > beyond directly crawling > the site? I know this is verboten but we are in dire need of > retrieving this data and don't know of any alternatives. The current > estimate of end of year is > too long for us to wait. Unfortunately, wikipedia is a favored source > for students to plagiarize from which makes out of date content a real > issue. > > Is there any way to help this process along? We can donate disk > drives, developer time, ...? There is another possibility > that we could offer but I would need to talk with someone at the > wikimedia foundation offline. Is there anyone I could > contact? > > Thanks for any information and/or direction you can give. > > Christian > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008
I have a decent server that is dedicated for a Wikipedia project that depends on the fresh dumps. Can this be used anyway to speed up the process of generating the dumps? bilal On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm wrote: > >> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: > >> The current enwiki database dump ( > http://download.wikimedia.org/enwiki/20081008/ > >> ) has been crawling along since 10/15/2008. > > The current dump system is not sustainable on very large wikis and > > is being replaced. You'll hear about it when we have the new one in > > place. :) > > -- brion > > Following up on this thread: > http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html > > Brion, > > Can you offer any general timeline estimates (weeks, months, 1/2 > year)? Are there any alternatives to retrieving the article data > beyond directly crawling > the site? I know this is verboten but we are in dire need of > retrieving this data and don't know of any alternatives. The current > estimate of end of year is > too long for us to wait. Unfortunately, wikipedia is a favored source > for students to plagiarize from which makes out of date content a real > issue. > > Is there any way to help this process along? We can donate disk > drives, developer time, ...? There is another possibility > that we could offer but I would need to talk with someone at the > wikimedia foundation offline. Is there anyone I could > contact? > > Thanks for any information and/or direction you can give. > > Christian > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Enwiki dump crawling since 10/15/2008
>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: >> The current enwiki database dump >> (http://download.wikimedia.org/enwiki/20081008/ >> ) has been crawling along since 10/15/2008. > The current dump system is not sustainable on very large wikis and > is being replaced. You'll hear about it when we have the new one in > place. :) > -- brion Following up on this thread: http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html Brion, Can you offer any general timeline estimates (weeks, months, 1/2 year)? Are there any alternatives to retrieving the article data beyond directly crawling the site? I know this is verboten but we are in dire need of retrieving this data and don't know of any alternatives. The current estimate of end of year is too long for us to wait. Unfortunately, wikipedia is a favored source for students to plagiarize from which makes out of date content a real issue. Is there any way to help this process along? We can donate disk drives, developer time, ...? There is another possibility that we could offer but I would need to talk with someone at the wikimedia foundation offline. Is there anyone I could contact? Thanks for any information and/or direction you can give. Christian ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008
Understood--thank you. Any time-frame for when this might be launched? On Mon, Jan 5, 2009 at 1:47 PM, Brion Vibber wrote: > On 1/4/09 6:20 AM, y...@alum.mit.edu wrote: >> The current enwiki database dump >> (http://download.wikimedia.org/enwiki/20081008/) has been crawling >> along since 10/15/2008. > > The current dump system is not sustainable on very large wikis and is > being replaced. You'll hear about it when we have the new one in place. :) > > -- brion > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008
On 1/4/09 6:20 AM, y...@alum.mit.edu wrote: > The current enwiki database dump > (http://download.wikimedia.org/enwiki/20081008/) has been crawling > along since 10/15/2008. The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :) -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008
I realize that. I'm looking forward to the the next dump :) I had been used to a dump of that part about every 2 months, and it's been about 3 now and the way it is headed it will be 12 before I see another! On Mon, Jan 5, 2009 at 9:58 AM, Russell Blau wrote: > wrote in message > news:1c624fe40901040620g1c69d070q9f830da33e84f...@mail.gmail.com... >> The current enwiki database dump >> (http://download.wikimedia.org/enwiki/20081008/) has been crawling >> along since 10/15/2008. > ... >> Is this purposeful? And is there anything I (or other community >> members) can do about it? I personally just need the pages-articles >> part. Would it be possible to dump up to that part on a different >> thread? > > That portion of the dump is already done, and available at > http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2 > > Russ > > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008
wrote in message news:1c624fe40901040620g1c69d070q9f830da33e84f...@mail.gmail.com... > The current enwiki database dump > (http://download.wikimedia.org/enwiki/20081008/) has been crawling > along since 10/15/2008. ... > Is this purposeful? And is there anything I (or other community > members) can do about it? I personally just need the pages-articles > part. Would it be possible to dump up to that part on a different > thread? That portion of the dump is already done, and available at http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2 Russ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Enwiki Dump Crawling since 10/15/2008
The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/) has been crawling along since 10/15/2008. I realize that dumps can appear stalled in their normal processing (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the recent past (as far as I know) they have not been stalled this long without there being something actually wrong. The completion date for "All pages with complete page edit history" (where it is currently) fluctuates within the latter half of 2009. Is this purposeful? And is there anything I (or other community members) can do about it? I personally just need the pages-articles part. Would it be possible to dump up to that part on a different thread? Thank you for your time. Gabriel Weinberg ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l