Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-29 Thread Keisial
Brion Vibber wrote:
>>> Decompression takes as long as compression with bzip2
>> I think decompression is *faster* than compression
>> http://tukaani.org/lzma/benchmarks
> 
> LZMA is nice and fast to decompress... but *insanely* slower to 
> compress, and doesn't seem as parallelizable. :(
> 
> -- brion

I used the lzma benchmark as evidence to support that decompressing
bzip2 is faster than compressing.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Anthony
On Thu, Mar 26, 2009 at 8:51 PM, ERSEK Laszlo  wrote:

> On 03/27/09 01:14, Brion Vibber wrote:
>
> > LZMA is nice and fast to decompress... but *insanely* slower to
> > compress, and doesn't seem as parallelizable. :(
>
> The xz file format should allow for "easy" parallelization, both when
> compressing and decompressing; see
>
> http://tukaani.org/xz/xz-file-format.txt
>
> 3. Block
> 3.1. Block Header
> 3.1.1. Block Header Size
> 3.1.3. Compressed Size
> 3.1.4. Uncompressed Size
> 3.1.6. Header Padding
> 3.3. Block Padding
>
> At least in theory, this "length-prefixing" should make it fairly
> straightforward to write a multi-threaded decompressor with a splitter
> that can work from a pipe and is input-bound. I reckon the xz structure
> will eventually prove useful even for distributed
> compression/decompression.
>
> lacos


It includes an index for random access too.  Cool.  I wonder what kind of
block size you'd need to get a compression ratio approaching that of 7z.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread ERSEK Laszlo
On 03/27/09 01:14, Brion Vibber wrote:

> LZMA is nice and fast to decompress... but *insanely* slower to 
> compress, and doesn't seem as parallelizable. :(

The xz file format should allow for "easy" parallelization, both when 
compressing and decompressing; see

http://tukaani.org/xz/xz-file-format.txt

3. Block
3.1. Block Header
3.1.1. Block Header Size
3.1.3. Compressed Size
3.1.4. Uncompressed Size
3.1.6. Header Padding
3.3. Block Padding

At least in theory, this "length-prefixing" should make it fairly 
straightforward to write a multi-threaded decompressor with a splitter 
that can work from a pipe and is input-bound. I reckon the xz structure 
will eventually prove useful even for distributed compression/decompression.

lacos

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Brion Vibber
On 3/26/09 3:25 PM, Keisial wrote:
> Quite interesting. Can the images at office.wikimedia.org be moved to
> somewhere public?

I've copied those two to the public wiki. :)

>> Decompression takes as long as compression with bzip2
> I think decompression is *faster* than compression
> http://tukaani.org/lzma/benchmarks

LZMA is nice and fast to decompress... but *insanely* slower to 
compress, and doesn't seem as parallelizable. :(

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Keisial
Tomasz Finc wrote:
> I've started drafting some new ideas at 
> http://wikitech.wikimedia.org/view/Data_dump_redesign
> 
> of the various problems that were facing and what kind of job management 
> we can put around it. Were taking this on as a full "should have been 
> done 2 years ago" project and I'm going to be shepherding this along.
> 
> Right now I'm collecting stats about the throughput of the components to 
> see how much in parallel this could be farmed out in a job management 
> system.
> 
> This is a large project that has some distinct problem areas that we'll 
> be isolating and welcoming help on.
> 
> --tomasz

Quite interesting. Can the images at office.wikimedia.org be moved to
somewhere public?

>Decompression takes as long as compression with bzip2
I think decompression is *faster* than compression
http://tukaani.org/lzma/benchmarks

Let me know if I can help with anything.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Tomasz Finc
On 3/25/09 10:08 AM, Christian Storm wrote:
> Thanks to everyone who got the enwiki dumps going again!  Should we expect
> more regular dumps now?  What was the final solution of fixing this?
>
>

Lots of love and upkeep by everyone :)

But really its needs to be more automated and made parallelised so that 
we can spot issues faster, validate inconsistencies, and finish quicker.

Brion and I have met about this and we've even brought it into the 
Wikimedia dev meetings to brainstorm how the system could change for the 
better.

I've started drafting some new ideas at 
http://wikitech.wikimedia.org/view/Data_dump_redesign

of the various problems that were facing and what kind of job management 
we can put around it. Were taking this on as a full "should have been 
done 2 years ago" project and I'm going to be shepherding this along.

Right now I'm collecting stats about the throughput of the components to 
see how much in parallel this could be farmed out in a job management 
system.

This is a large project that has some distinct problem areas that we'll 
be isolating and welcoming help on.

--tomasz



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread John Doe
toolserver users dont have access to text

On Wed, Mar 25, 2009 at 7:05 PM, Brian  wrote:

> Perhaps the toolserver can make you a current dump of current en?
>
> On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm  >wrote:
>
> > Thanks to everyone who got the enwiki dumps going again!  Should we
> expect
> > more regular dumps now?  What was the final solution of fixing this?
> >
> >
> >
> > >
> > > We are having to resort to crawling en.wikipedia.org while we await
> > > for regular dumps.
> > > What is the minimum crawling delay we can get away with? I figure if we
> > > have 1 second delay then we'd be able to crawl the 2+ million articles
> > > in a month.
> > >
> > > I know crawling is discouraged but it seems a lot of parties still do
> > > so after looking at robots.txt
> > > I have to assume that is how Google et al. is able to keep up to date.
> > >
> > > Are their private data feeds?  I noticed a wg_enwiki dump listed.
> > >
> > > Christian
> > >
> > > On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:
> > >
> > > > That would be great.  I second this notion whole heartedly.
> > > >
> > > >
> > > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
> > > >
> > > >> "Brion Vibber"  wrote in message
> > > >> news:497f9c35.9050...@wikimedia.org...
> > > >>> On 1/27/09 2:55 PM, Robert Rohde wrote:
> > >  On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber >
> > >  wrote:
> > > > On 1/27/09 2:35 PM, Thomas Dalton wrote:
> > > >> The way I see it, what we need is to get a really powerful
> server
> > > > Nope, it's a software architecture issue. We'll restart it with
> > > > the new
> > > > arch when it's ready to go.
> > >  The simplest solution is just to kill the current dump job if you
> > >  have
> > >  faith that a new architecture can be put in place in less than a
> > >  year.
> > > >>>
> > > >>> We'll probably do that.
> > > >>>
> > > >>> -- brion
> > > >>
> > > >> FWIW, I'll add my vote for aborting the current dump *now* if we
> > > >> don't
> > > >> expect it ever to actually be finished, so we can at least get a
> > > >> fresh dump
> > > >> of the current pages.
> > > >>
> > > >> Russ
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> ___
> > > >> Wikitech-l mailing list
> > > >> Wikitech-l@lists.wikimedia.org
> > > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > > >
> > > >
> > > > ___
> > > > Wikitech-l mailing list
> > > > Wikitech-l@lists.wikimedia.org
> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > >
> > > ___
> > > Wikitech-l mailing list
> > > Wikitech-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Brian
Perhaps the toolserver can make you a current dump of current en?

On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm wrote:

> Thanks to everyone who got the enwiki dumps going again!  Should we expect
> more regular dumps now?  What was the final solution of fixing this?
>
>
>
> >
> > We are having to resort to crawling en.wikipedia.org while we await
> > for regular dumps.
> > What is the minimum crawling delay we can get away with? I figure if we
> > have 1 second delay then we'd be able to crawl the 2+ million articles
> > in a month.
> >
> > I know crawling is discouraged but it seems a lot of parties still do
> > so after looking at robots.txt
> > I have to assume that is how Google et al. is able to keep up to date.
> >
> > Are their private data feeds?  I noticed a wg_enwiki dump listed.
> >
> > Christian
> >
> > On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:
> >
> > > That would be great.  I second this notion whole heartedly.
> > >
> > >
> > > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
> > >
> > >> "Brion Vibber"  wrote in message
> > >> news:497f9c35.9050...@wikimedia.org...
> > >>> On 1/27/09 2:55 PM, Robert Rohde wrote:
> >  On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber
> >  wrote:
> > > On 1/27/09 2:35 PM, Thomas Dalton wrote:
> > >> The way I see it, what we need is to get a really powerful server
> > > Nope, it's a software architecture issue. We'll restart it with
> > > the new
> > > arch when it's ready to go.
> >  The simplest solution is just to kill the current dump job if you
> >  have
> >  faith that a new architecture can be put in place in less than a
> >  year.
> > >>>
> > >>> We'll probably do that.
> > >>>
> > >>> -- brion
> > >>
> > >> FWIW, I'll add my vote for aborting the current dump *now* if we
> > >> don't
> > >> expect it ever to actually be finished, so we can at least get a
> > >> fresh dump
> > >> of the current pages.
> > >>
> > >> Russ
> > >>
> > >>
> > >>
> > >>
> > >> ___
> > >> Wikitech-l mailing list
> > >> Wikitech-l@lists.wikimedia.org
> > >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> > >
> > > ___
> > > Wikitech-l mailing list
> > > Wikitech-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-25 Thread Christian Storm
Thanks to everyone who got the enwiki dumps going again!  Should we expect
more regular dumps now?  What was the final solution of fixing this?



>
> We are having to resort to crawling en.wikipedia.org while we await
> for regular dumps.
> What is the minimum crawling delay we can get away with? I figure if we
> have 1 second delay then we'd be able to crawl the 2+ million articles
> in a month.
>
> I know crawling is discouraged but it seems a lot of parties still do
> so after looking at robots.txt
> I have to assume that is how Google et al. is able to keep up to date.
>
> Are their private data feeds?  I noticed a wg_enwiki dump listed.
>
> Christian
>
> On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:
>
> > That would be great.  I second this notion whole heartedly.
> >
> >
> > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
> >
> >> "Brion Vibber"  wrote in message
> >> news:497f9c35.9050...@wikimedia.org...
> >>> On 1/27/09 2:55 PM, Robert Rohde wrote:
>  On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber
>  wrote:
> > On 1/27/09 2:35 PM, Thomas Dalton wrote:
> >> The way I see it, what we need is to get a really powerful server
> > Nope, it's a software architecture issue. We'll restart it with
> > the new
> > arch when it's ready to go.
>  The simplest solution is just to kill the current dump job if you
>  have
>  faith that a new architecture can be put in place in less than a
>  year.
> >>>
> >>> We'll probably do that.
> >>>
> >>> -- brion
> >>
> >> FWIW, I'll add my vote for aborting the current dump *now* if we
> >> don't
> >> expect it ever to actually be finished, so we can at least get a
> >> fresh dump
> >> of the current pages.
> >>
> >> Russ
> >>
> >>
> >>
> >>
> >> ___
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-02-10 Thread Christian Storm
Brion,

We are having to resort to crawling en.wikipedia.org while we await  
for regular dumps.
What is the minimum crawling delay we can get away with? I figure if we
have 1 second delay then we'd be able to crawl the 2+ million articles  
in a month.

I know crawling is discouraged but it seems a lot of parties still do  
so after looking at robots.txt
I have to assume that is how Google et al. is able to keep up to date.

Are their private data feeds?  I noticed a wg_enwiki dump listed.

Christian

On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

> That would be great.  I second this notion whole heartedly.
>
>
> On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
>
>> "Brion Vibber"  wrote in message
>> news:497f9c35.9050...@wikimedia.org...
>>> On 1/27/09 2:55 PM, Robert Rohde wrote:
 On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber
 wrote:
> On 1/27/09 2:35 PM, Thomas Dalton wrote:
>> The way I see it, what we need is to get a really powerful server
> Nope, it's a software architecture issue. We'll restart it with
> the new
> arch when it's ready to go.
 The simplest solution is just to kill the current dump job if you
 have
 faith that a new architecture can be put in place in less than a
 year.
>>>
>>> We'll probably do that.
>>>
>>> -- brion
>>
>> FWIW, I'll add my vote for aborting the current dump *now* if we  
>> don't
>> expect it ever to actually be finished, so we can at least get a
>> fresh dump
>> of the current pages.
>>
>> Russ
>>
>>
>>
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Robert Rohde
On Thu, Jan 29, 2009 at 11:20 AM, Brion Vibber  wrote:
> On 1/28/09 8:32 AM, Brion Vibber wrote:
>> Probably wise to poke in a hack to skip the history first. :)
>
> Done in r46545.
>
> Updated dump scripts and canceled the old enwiki dump.
>
> New dumps also will be attempting to generate log output as XML which
> correctly handles the deletion/oversighting options; we'll see hwo that
> goes. :)

Is there somewhere that explains (or at least gives an example) of the
new logging format and what has changed?

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Brion Vibber
On 1/28/09 8:32 AM, Brion Vibber wrote:
> Probably wise to poke in a hack to skip the history first. :)

Done in r46545.

Updated dump scripts and canceled the old enwiki dump.

New dumps also will be attempting to generate log output as XML which 
correctly handles the deletion/oversighting options; we'll see hwo that 
goes. :)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Robert Rohde
On Thu, Jan 29, 2009 at 1:52 AM, Gerard Meijssen
 wrote:
> Hoi,
> Two things:
>
>   - if we abort the backup now, we do not know if we WILL have something at
>   the time it would have ended
>   - if the toolserver data can provide a service as a stop gap measure why
>   not provide that in the mean time

If you want to play the optimist and believe this dump might
eventually accomplish something, then the right stopgap would be the
hack the dumper so that it periodically regenerates the other files
even while the big dump is still running.  Such a thing, though
definitely a hack, would not be hard to do.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Gerard Meijssen
Hoi,
Two things:

   - if we abort the backup now, we do not know if we WILL have something at
   the time it would have ended
   - if the toolserver data can provide a service as a stop gap measure why
   not provide that in the mean time

Thanks,
  GerardM

2009/1/29 Alai 

> Russell Blau  hotmail.com> writes:
> > FWIW, I'll add my vote for aborting the current dump *now* if we don't
> > expect it ever to actually be finished, so we can at least get a fresh
> dump
> > of the current pages.
>
> I'd like to third/fourth/(other ordinal) this idea too.  I've been using
> the
> (in comparison tiny) SQL dumps for various purposes, and it's most vexing
> that these have to wait until the end (or lack of any end...) of the larger
> XML dumps.  (The same data is replicated on the toolserver, of course, but
> I'd get beaten to death if I tried to run some of the data collection
> scripts
> I've been running offline, there.)
>
> Cheers,
> Alai.
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-29 Thread Alai
Russell Blau  hotmail.com> writes:
> FWIW, I'll add my vote for aborting the current dump *now* if we don't 
> expect it ever to actually be finished, so we can at least get a fresh dump 
> of the current pages.

I'd like to third/fourth/(other ordinal) this idea too.  I've been using the
(in comparison tiny) SQL dumps for various purposes, and it's most vexing
that these have to wait until the end (or lack of any end...) of the larger
XML dumps.  (The same data is replicated on the toolserver, of course, but
I'd get beaten to death if I tried to run some of the data collection scripts
I've been running offline, there.)

Cheers,
Alai.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Christian Storm
That would be great.  I second this notion whole heartedly.


On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:

> "Brion Vibber"  wrote in message
> news:497f9c35.9050...@wikimedia.org...
>> On 1/27/09 2:55 PM, Robert Rohde wrote:
>>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber
>>> wrote:
 On 1/27/09 2:35 PM, Thomas Dalton wrote:
> The way I see it, what we need is to get a really powerful server
 Nope, it's a software architecture issue. We'll restart it with  
 the new
 arch when it's ready to go.
>>> The simplest solution is just to kill the current dump job if you  
>>> have
>>> faith that a new architecture can be put in place in less than a  
>>> year.
>>
>> We'll probably do that.
>>
>> -- brion
>
> FWIW, I'll add my vote for aborting the current dump *now* if we don't
> expect it ever to actually be finished, so we can at least get a  
> fresh dump
> of the current pages.
>
> Russ
>
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Brion Vibber
Probably wise to poke in a hack to skip the history first. :)

-- brion vibber (brion @ wikimedia.org)

On Jan 28, 2009, at 7:34, "Russell Blau"  wrote:

> "Brion Vibber"  wrote in message
> news:497f9c35.9050...@wikimedia.org...
>> On 1/27/09 2:55 PM, Robert Rohde wrote:
>>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber
>>> wrote:
 On 1/27/09 2:35 PM, Thomas Dalton wrote:
> The way I see it, what we need is to get a really powerful server
 Nope, it's a software architecture issue. We'll restart it with  
 the new
 arch when it's ready to go.
>>> The simplest solution is just to kill the current dump job if you  
>>> have
>>> faith that a new architecture can be put in place in less than a  
>>> year.
>>
>> We'll probably do that.
>>
>> -- brion
>
> FWIW, I'll add my vote for aborting the current dump *now* if we don't
> expect it ever to actually be finished, so we can at least get a  
> fresh dump
> of the current pages.
>
> Russ
>
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-28 Thread Russell Blau
"Brion Vibber"  wrote in message 
news:497f9c35.9050...@wikimedia.org...
> On 1/27/09 2:55 PM, Robert Rohde wrote:
>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber 
>> wrote:
>>> On 1/27/09 2:35 PM, Thomas Dalton wrote:
 The way I see it, what we need is to get a really powerful server
>>> Nope, it's a software architecture issue. We'll restart it with the new
>>> arch when it's ready to go.
>> The simplest solution is just to kill the current dump job if you have
>> faith that a new architecture can be put in place in less than a year.
>
> We'll probably do that.
>
> -- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't 
expect it ever to actually be finished, so we can at least get a fresh dump 
of the current pages.

Russ




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:55 PM, Robert Rohde wrote:
> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber  wrote:
>> On 1/27/09 2:35 PM, Thomas Dalton wrote:
>>> The way I see it, what we need is to get a really powerful server
>> Nope, it's a software architecture issue. We'll restart it with the new
>> arch when it's ready to go.
>
> I don't know what your timetable is, but what about doing something to
> address the other aspects of the dump (logs, stubs, etc.) that are in
> limbo while full history chugs along.  All the other enwiki files are
> now 3 months old and that is already enough to inconvenience some
> people.
>
> The simplest solution is just to kill the current dump job if you have
> faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber  wrote:
> On 1/27/09 2:35 PM, Thomas Dalton wrote:
>> The way I see it, what we need is to get a really powerful server
>
> Nope, it's a software architecture issue. We'll restart it with the new
> arch when it's ready to go.

I don't know what your timetable is, but what about doing something to
address the other aspects of the dump (logs, stubs, etc.) that are in
limbo while full history chugs along.  All the other enwiki files are
now 3 months old and that is already enough to inconvenience some
people.

The simplest solution is just to kill the current dump job if you have
faith that a new architecture can be put in place in less than a year.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Brion Vibber
On 1/27/09 2:35 PM, Thomas Dalton wrote:
> The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new 
arch when it's ready to go.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Thomas Dalton
> Whether we want to let the current process continue to try and finish
> or not, I would seriously suggest someone look into redumping the rest
> of the enwiki files (i.e. logs, current pages, etc.).  I am also among
> the people that care about having reasonably fresh dumps and it really
> is a problem that the other dumps (e.g. stubs-meta-history) are frozen
> while we wait to see if the full history dump can run to completion.

Even if we do let it finish, I'm not sure a dump of what Wikipedia was
like 13 months ago is much use... The way I see it, what we need is to
get a really powerful server to do the dump just once at a reasonable
speed and then we'll have a previous dump to build on so future ones
would be more reasonable.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Robert Rohde
The problem, as I understand it (and Brion may come by to correct me)
is essentially that the current dump process is designed in a way that
can't be sustained given the size of enwiki.  It really needs to be
re-engineered, which means that developer time is needed to create a
new approach to dumping.

The main target for improvement is almost certainly parallelizing the
process so that wouldn't be a single monolithic dump process, but
rather a lot of little processes working in parallel.  That would also
ensure that if a single process gets stuck and dies, the entire dump
doesn't need to start over.


By way of observation, the dewiki's full history dumps in 26 hours
with 96% prefetched (i.e. loaded from previous dumps).  That suggests
that even starting from scratch (prefetch = 0%) it should dump in ~25
days under the current process.  enwiki is perhaps 3-6 times larger
than dewiki depending on how you do the accounting, which implies
dumping the whole thing from scratch would take ~5 months if the
process scaled linearly.  Of course it doesn't scale linearly, and we
end up with a prediction for completion that is currently 10 months
away (which amounts to a 13 month total execution).  And of course, if
there is any serious error in the next ten months the entire process
could die with no result.


Whether we want to let the current process continue to try and finish
or not, I would seriously suggest someone look into redumping the rest
of the enwiki files (i.e. logs, current pages, etc.).  I am also among
the people that care about having reasonably fresh dumps and it really
is a problem that the other dumps (e.g. stubs-meta-history) are frozen
while we wait to see if the full history dump can run to completion.

-Robert Rohde


On Tue, Jan 27, 2009 at 11:24 AM, Christian Storm  wrote:
>>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
>>> The current enwiki database dump 
>>> (http://download.wikimedia.org/enwiki/20081008/
>>> ) has been crawling along since 10/15/2008.
>> The current dump system is not sustainable on very large wikis and
>> is being replaced. You'll hear about it when we have the new one in
>> place. :)
>> -- brion
>
> Following up on this thread:  
> http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html
>
> Brion,
>
> Can you offer any general timeline estimates (weeks, months, 1/2
> year)?  Are there any alternatives to retrieving the article data
> beyond directly crawling
> the site?  I know this is verboten but we are in dire need of
> retrieving this data and don't know of any alternatives.  The current
> estimate of end of year is
> too long for us to wait.  Unfortunately, wikipedia is a favored source
> for students to plagiarize from which makes out of date content a real
> issue.
>
> Is there any way to help this process along?  We can donate disk
> drives, developer time, ...?  There is another possibility
> that we could offer but I would need to talk with someone at the
> wikimedia foundation offline.  Is there anyone I could
> contact?
>
> Thanks for any information and/or direction you can give.
>
> Christian
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Bilal Abdul Kader
I have a decent server that is dedicated for a Wikipedia project that
depends on the fresh dumps. Can this be used anyway to speed up the process
of generating the dumps?

bilal


On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm wrote:

> >> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
> >> The current enwiki database dump (
> http://download.wikimedia.org/enwiki/20081008/
> >> ) has been crawling along since 10/15/2008.
> > The current dump system is not sustainable on very large wikis and
> > is being replaced. You'll hear about it when we have the new one in
> > place. :)
> > -- brion
>
> Following up on this thread:
> http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html
>
> Brion,
>
> Can you offer any general timeline estimates (weeks, months, 1/2
> year)?  Are there any alternatives to retrieving the article data
> beyond directly crawling
> the site?  I know this is verboten but we are in dire need of
> retrieving this data and don't know of any alternatives.  The current
> estimate of end of year is
> too long for us to wait.  Unfortunately, wikipedia is a favored source
> for students to plagiarize from which makes out of date content a real
> issue.
>
> Is there any way to help this process along?  We can donate disk
> drives, developer time, ...?  There is another possibility
> that we could offer but I would need to talk with someone at the
> wikimedia foundation offline.  Is there anyone I could
> contact?
>
> Thanks for any information and/or direction you can give.
>
> Christian
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-01-27 Thread Christian Storm
>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
>> The current enwiki database dump 
>> (http://download.wikimedia.org/enwiki/20081008/ 
>> ) has been crawling along since 10/15/2008.
> The current dump system is not sustainable on very large wikis and  
> is being replaced. You'll hear about it when we have the new one in  
> place. :)
> -- brion

Following up on this thread:  
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

Brion,

Can you offer any general timeline estimates (weeks, months, 1/2  
year)?  Are there any alternatives to retrieving the article data  
beyond directly crawling
the site?  I know this is verboten but we are in dire need of  
retrieving this data and don't know of any alternatives.  The current  
estimate of end of year is
too long for us to wait.  Unfortunately, wikipedia is a favored source  
for students to plagiarize from which makes out of date content a real  
issue.

Is there any way to help this process along?  We can donate disk  
drives, developer time, ...?  There is another possibility
that we could offer but I would need to talk with someone at the  
wikimedia foundation offline.  Is there anyone I could
contact?

Thanks for any information and/or direction you can give.

Christian


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread yegg
Understood--thank you.  Any time-frame for when this might be launched?

On Mon, Jan 5, 2009 at 1:47 PM, Brion Vibber  wrote:
> On 1/4/09 6:20 AM, y...@alum.mit.edu wrote:
>> The current enwiki database dump
>> (http://download.wikimedia.org/enwiki/20081008/) has been crawling
>> along since 10/15/2008.
>
> The current dump system is not sustainable on very large wikis and is
> being replaced. You'll hear about it when we have the new one in place. :)
>
> -- brion
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread Brion Vibber
On 1/4/09 6:20 AM, y...@alum.mit.edu wrote:
> The current enwiki database dump
> (http://download.wikimedia.org/enwiki/20081008/) has been crawling
> along since 10/15/2008.

The current dump system is not sustainable on very large wikis and is 
being replaced. You'll hear about it when we have the new one in place. :)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread yegg
I realize that.  I'm looking forward to the the next dump :)

I had been used to a dump of that part about every 2 months, and it's
been about 3 now and the way it is headed it will be 12 before I see
another!

On Mon, Jan 5, 2009 at 9:58 AM, Russell Blau  wrote:
>  wrote in message
> news:1c624fe40901040620g1c69d070q9f830da33e84f...@mail.gmail.com...
>> The current enwiki database dump
>> (http://download.wikimedia.org/enwiki/20081008/) has been crawling
>> along since 10/15/2008.
> ...
>> Is this purposeful?  And is there anything I (or other community
>> members) can do about it?  I personally just need the pages-articles
>> part.  Would it be possible to dump up to that part on a different
>> thread?
>
> That portion of the dump is already done, and available at
> http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2
>
> Russ
>
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-05 Thread Russell Blau
 wrote in message 
news:1c624fe40901040620g1c69d070q9f830da33e84f...@mail.gmail.com...
> The current enwiki database dump
> (http://download.wikimedia.org/enwiki/20081008/) has been crawling
> along since 10/15/2008.
...
> Is this purposeful?  And is there anything I (or other community
> members) can do about it?  I personally just need the pages-articles
> part.  Would it be possible to dump up to that part on a different
> thread?

That portion of the dump is already done, and available at 
http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-pages-articles.xml.bz2

Russ




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Enwiki Dump Crawling since 10/15/2008

2009-01-04 Thread yegg
The current enwiki database dump
(http://download.wikimedia.org/enwiki/20081008/) has been crawling
along since 10/15/2008.

I realize that dumps can appear stalled in their normal processing
(http://meta.wikimedia.org/wiki/Data_dumps#Schedule), but in the
recent past (as far as I know) they have not been stalled this long
without there being something actually wrong.  The completion date for
"All pages with complete page edit history" (where it is currently)
fluctuates within the latter half of 2009.

Is this purposeful?  And is there anything I (or other community
members) can do about it?  I personally just need the pages-articles
part.  Would it be possible to dump up to that part on a different
thread?

Thank you for your time.

Gabriel Weinberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l