Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-29 Thread Keisial
Brion Vibber wrote: >>> Decompression takes as long as compression with bzip2 >> I think decompression is *faster* than compression >> http://tukaani.org/lzma/benchmarks > > LZMA is nice and fast to decompress... but *insanely* slower to > compress, and doesn't seem as parallelizable. :( > > --

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Anthony
On Thu, Mar 26, 2009 at 8:51 PM, ERSEK Laszlo wrote: > On 03/27/09 01:14, Brion Vibber wrote: > > > LZMA is nice and fast to decompress... but *insanely* slower to > > compress, and doesn't seem as parallelizable. :( > > The xz file format should allow for "easy" parallelization, both when > comp

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread ERSEK Laszlo
On 03/27/09 01:14, Brion Vibber wrote: > LZMA is nice and fast to decompress... but *insanely* slower to > compress, and doesn't seem as parallelizable. :( The xz file format should allow for "easy" parallelization, both when compressing and decompressing; see http://tukaani.org/xz/xz-file-for

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Brion Vibber
On 3/26/09 3:25 PM, Keisial wrote: > Quite interesting. Can the images at office.wikimedia.org be moved to > somewhere public? I've copied those two to the public wiki. :) >> Decompression takes as long as compression with bzip2 > I think decompression is *faster* than compression > http://tukaan

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Keisial
Tomasz Finc wrote: > I've started drafting some new ideas at > http://wikitech.wikimedia.org/view/Data_dump_redesign > > of the various problems that were facing and what kind of job management > we can put around it. Were taking this on as a full "should have been > done 2 years ago" project a

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Tomasz Finc
On 3/25/09 10:08 AM, Christian Storm wrote: > Thanks to everyone who got the enwiki dumps going again! Should we expect > more regular dumps now? What was the final solution of fixing this? > > Lots of love and upkeep by everyone :) But really its needs to be more automated and made parallelise

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread John Doe
toolserver users dont have access to text On Wed, Mar 25, 2009 at 7:05 PM, Brian wrote: > Perhaps the toolserver can make you a current dump of current en? > > On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm >wrote: > > > Thanks to everyone who got the enwiki dumps going again! Should we > e

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Brian
Perhaps the toolserver can make you a current dump of current en? On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm wrote: > Thanks to everyone who got the enwiki dumps going again! Should we expect > more regular dumps now? What was the final solution of fixing this? > > > > > > > We are havin

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Christian Storm
Thanks to everyone who got the enwiki dumps going again! Should we expect more regular dumps now? What was the final solution of fixing this? > > We are having to resort to crawling en.wikipedia.org while we await > for regular dumps. > What is the minimum crawling delay we can get away with?

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-02-10 Thread Christian Storm
Brion, We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month. I know crawling is discouraged but it seems

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Robert Rohde
On Thu, Jan 29, 2009 at 11:20 AM, Brion Vibber wrote: > On 1/28/09 8:32 AM, Brion Vibber wrote: >> Probably wise to poke in a hack to skip the history first. :) > > Done in r46545. > > Updated dump scripts and canceled the old enwiki dump. > > New dumps also will be attempting to generate log outp

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Brion Vibber
On 1/28/09 8:32 AM, Brion Vibber wrote: > Probably wise to poke in a hack to skip the history first. :) Done in r46545. Updated dump scripts and canceled the old enwiki dump. New dumps also will be attempting to generate log output as XML which correctly handles the deletion/oversighting option

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Robert Rohde
On Thu, Jan 29, 2009 at 1:52 AM, Gerard Meijssen wrote: > Hoi, > Two things: > > - if we abort the backup now, we do not know if we WILL have something at > the time it would have ended > - if the toolserver data can provide a service as a stop gap measure why > not provide that in the mea

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Gerard Meijssen
Hoi, Two things: - if we abort the backup now, we do not know if we WILL have something at the time it would have ended - if the toolserver data can provide a service as a stop gap measure why not provide that in the mean time Thanks, GerardM 2009/1/29 Alai > Russell Blau ho

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Alai
Russell Blau hotmail.com> writes: > FWIW, I'll add my vote for aborting the current dump *now* if we don't > expect it ever to actually be finished, so we can at least get a fresh dump > of the current pages. I'd like to third/fourth/(other ordinal) this idea too. I've been using the (in compa

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Christian Storm
That would be great. I second this notion whole heartedly. On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: > "Brion Vibber" wrote in message > news:497f9c35.9050...@wikimedia.org... >> On 1/27/09 2:55 PM, Robert Rohde wrote: >>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber >>> wrote: On

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Brion Vibber
Probably wise to poke in a hack to skip the history first. :) -- brion vibber (brion @ wikimedia.org) On Jan 28, 2009, at 7:34, "Russell Blau" wrote: > "Brion Vibber" wrote in message > news:497f9c35.9050...@wikimedia.org... >> On 1/27/09 2:55 PM, Robert Rohde wrote: >>> On Tue, Jan 27, 2009 a

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Russell Blau
"Brion Vibber" wrote in message news:497f9c35.9050...@wikimedia.org... > On 1/27/09 2:55 PM, Robert Rohde wrote: >> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber >> wrote: >>> On 1/27/09 2:35 PM, Thomas Dalton wrote: The way I see it, what we need is to get a really powerful server >>> Nope

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:55 PM, Robert Rohde wrote: > On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber wrote: >> On 1/27/09 2:35 PM, Thomas Dalton wrote: >>> The way I see it, what we need is to get a really powerful server >> Nope, it's a software architecture issue. We'll restart it with the new >> arch when i

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber wrote: > On 1/27/09 2:35 PM, Thomas Dalton wrote: >> The way I see it, what we need is to get a really powerful server > > Nope, it's a software architecture issue. We'll restart it with the new > arch when it's ready to go. I don't know what your tim

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:35 PM, Thomas Dalton wrote: > The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go. -- brion ___ Wikitech-l mailing list Wikit

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Thomas Dalton
> Whether we want to let the current process continue to try and finish > or not, I would seriously suggest someone look into redumping the rest > of the enwiki files (i.e. logs, current pages, etc.). I am also among > the people that care about having reasonably fresh dumps and it really > is a p

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
The problem, as I understand it (and Brion may come by to correct me) is essentially that the current dump process is designed in a way that can't be sustained given the size of enwiki. It really needs to be re-engineered, which means that developer time is needed to create a new approach to dumpi

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Bilal Abdul Kader
I have a decent server that is dedicated for a Wikipedia project that depends on the fresh dumps. Can this be used anyway to speed up the process of generating the dumps? bilal On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm wrote: > >> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: > >> The c

[Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Christian Storm
>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: >> The current enwiki database dump >> (http://download.wikimedia.org/enwiki/20081008/ >> ) has been crawling along since 10/15/2008. > The current dump system is not sustainable on very large wikis and > is being replaced. You'll hear about it

Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread yegg
Understood--thank you. Any time-frame for when this might be launched? On Mon, Jan 5, 2009 at 1:47 PM, Brion Vibber wrote: > On 1/4/09 6:20 AM, y...@alum.mit.edu wrote: >> The current enwiki database dump >> (http://download.wikimedia.org/enwiki/20081008/) has been crawling >> along since 10/15/

Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread Brion Vibber
On 1/4/09 6:20 AM, y...@alum.mit.edu wrote: > The current enwiki database dump > (http://download.wikimedia.org/enwiki/20081008/) has been crawling > along since 10/15/2008. The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the

Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread yegg
I realize that. I'm looking forward to the the next dump :) I had been used to a dump of that part about every 2 months, and it's been about 3 now and the way it is headed it will be 12 before I see another! On Mon, Jan 5, 2009 at 9:58 AM, Russell Blau wrote: > wrote in message > news:1c624fe4

Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread Russell Blau
wrote in message news:1c624fe40901040620g1c69d070q9f830da33e84f...@mail.gmail.com... > The current enwiki database dump > (http://download.wikimedia.org/enwiki/20081008/) has been crawling > along since 10/15/2008. ... > Is this purposeful? And is there anything I (or other community > members)

[Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-04 Thread yegg
The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/) has been crawling along since 10/15/2008. I realize that dumps can appear stalled in their normal processing (http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the recent past (as far as I know) they have n