Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-29 Thread Keisial
Brion Vibber wrote: Decompression takes as long as compression with bzip2 I think decompression is *faster* than compression http://tukaani.org/lzma/benchmarks LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :( -- brion I

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Brion Vibber
On 3/26/09 3:25 PM, Keisial wrote: Quite interesting. Can the images at office.wikimedia.org be moved to somewhere public? I've copied those two to the public wiki. :) Decompression takes as long as compression with bzip2 I think decompression is *faster* than compression

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread ERSEK Laszlo
On 03/27/09 01:14, Brion Vibber wrote: LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :( The xz file format should allow for easy parallelization, both when compressing and decompressing; see

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Brian
Perhaps the toolserver can make you a current dump of current en? On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm st...@iparadigms.comwrote: Thanks to everyone who got the enwiki dumps going again! Should we expect more regular dumps now? What was the final solution of fixing this?

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread John Doe
toolserver users dont have access to text On Wed, Mar 25, 2009 at 7:05 PM, Brian brian.min...@colorado.edu wrote: Perhaps the toolserver can make you a current dump of current en? On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm st...@iparadigms.com wrote: Thanks to everyone who got the

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-02-10 Thread Christian Storm
Brion, We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month. I know crawling is discouraged but it seems

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Alai
Russell Blau russblau at hotmail.com writes: FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages. I'd like to third/fourth/(other ordinal) this idea too. I've been using the

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Gerard Meijssen
Hoi, Two things: - if we abort the backup now, we do not know if we WILL have something at the time it would have ended - if the toolserver data can provide a service as a stop gap measure why not provide that in the mean time Thanks, GerardM 2009/1/29 Alai alaiw...@gmail.com

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Robert Rohde
On Thu, Jan 29, 2009 at 1:52 AM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, Two things: - if we abort the backup now, we do not know if we WILL have something at the time it would have ended - if the toolserver data can provide a service as a stop gap measure why not

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Brion Vibber
On 1/28/09 8:32 AM, Brion Vibber wrote: Probably wise to poke in a hack to skip the history first. :) Done in r46545. Updated dump scripts and canceled the old enwiki dump. New dumps also will be attempting to generate log output as XML which correctly handles the deletion/oversighting

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Robert Rohde
On Thu, Jan 29, 2009 at 11:20 AM, Brion Vibber br...@wikimedia.org wrote: On 1/28/09 8:32 AM, Brion Vibber wrote: Probably wise to poke in a hack to skip the history first. :) Done in r46545. Updated dump scripts and canceled the old enwiki dump. New dumps also will be attempting to

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Russell Blau
Brion Vibber br...@wikimedia.org wrote in message news:497f9c35.9050...@wikimedia.org... On 1/27/09 2:55 PM, Robert Rohde wrote: On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbr...@wikimedia.org wrote: On 1/27/09 2:35 PM, Thomas Dalton wrote: The way I see it, what we need is to get a really

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Brion Vibber
Probably wise to poke in a hack to skip the history first. :) -- brion vibber (brion @ wikimedia.org) On Jan 28, 2009, at 7:34, Russell Blau russb...@hotmail.com wrote: Brion Vibber br...@wikimedia.org wrote in message news:497f9c35.9050...@wikimedia.org... On 1/27/09 2:55 PM, Robert Rohde

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Christian Storm
That would be great. I second this notion whole heartedly. On Jan 28, 2009, at 7:34 AM, Russell Blau wrote: Brion Vibber br...@wikimedia.org wrote in message news:497f9c35.9050...@wikimedia.org... On 1/27/09 2:55 PM, Robert Rohde wrote: On Tue, Jan 27, 2009 at 2:42 PM, Brion

[Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Christian Storm
On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/ ) has been crawling along since 10/15/2008. The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Bilal Abdul Kader
I have a decent server that is dedicated for a Wikipedia project that depends on the fresh dumps. Can this be used anyway to speed up the process of generating the dumps? bilal On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm st...@iparadigms.comwrote: On 1/4/09 6:20 AM, yegg at alum.mit.edu

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
The problem, as I understand it (and Brion may come by to correct me) is essentially that the current dump process is designed in a way that can't be sustained given the size of enwiki. It really needs to be re-engineered, which means that developer time is needed to create a new approach to

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Thomas Dalton
Whether we want to let the current process continue to try and finish or not, I would seriously suggest someone look into redumping the rest of the enwiki files (i.e. logs, current pages, etc.). I am also among the people that care about having reasonably fresh dumps and it really is a

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:35 PM, Thomas Dalton wrote: The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go. -- brion ___ Wikitech-l mailing list

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber br...@wikimedia.org wrote: On 1/27/09 2:35 PM, Thomas Dalton wrote: The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go. I don't know

Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread Brion Vibber
On 1/4/09 6:20 AM, y...@alum.mit.edu wrote: The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/) has been crawling along since 10/15/2008. The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the

Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread yegg
Understood--thank you. Any time-frame for when this might be launched? On Mon, Jan 5, 2009 at 1:47 PM, Brion Vibber br...@wikimedia.org wrote: On 1/4/09 6:20 AM, y...@alum.mit.edu wrote: The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/) has been crawling along