Re: [Wikitech-l] Dump throughput

2009-05-14 Thread Brion Vibber
El 5/14/09 2:17 PM, Alex escribió:
> Brion Vibber wrote:
>> I believe that refers to yesterday's replication lag on the machine
>> running watchlist queries; the abstract dump process that was hitting
>> that particular server was aborted yesterday.
>>
>
> Is Yahoo still using those? Looking at the last successful one for
> enwiki, it looks like it took a little more than a day to generate.
> Combined with all the other projects, that seems like an awful lot of
> processing time spent for something of questionable utility to anyone
> but Yahoo.

Actually, yes. :) I've occasionally heard from other folks using them 
for stuff, but indeed Yahoo is still grabbing them.

The script needs some cleaning up, and it wouldn't hurt to rearchitect 
how it's generated in general. (First-sentence summary extraction is 
also being done in OpenSearchXml for IE 8's search support, and I've 
improved the implementation there. It should get merged back, and 
probably merged into core so we can make the extracts more generally 
available for other uses.)

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-14 Thread Alex
Brion Vibber wrote:
> I believe that refers to yesterday's replication lag on the machine 
> running watchlist queries; the abstract dump process that was hitting 
> that particular server was aborted yesterday.
> 

Is Yahoo still using those? Looking at the last successful one for
enwiki, it looks like it took a little more than a day to generate.
Combined with all the other projects, that seems like an awful lot of
processing time spent for something of questionable utility to anyone
but Yahoo.

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Brion Vibber
El 5/13/09 9:06 PM, Robert Rohde escribió:
> There is a thread on enwiki WP:VPT [1] speculating that the sluggish
> server performance that some people are seeing is being caused by the
> dumper working on enwiki.
>
> This strikes me as implausible, but I thought I'd mention it here in
> case it could be true.  I suppose it is at least possible to expand
> the dumper enough to have a noticable effect on other aspects of site
> performance, but I wouldn't expect it to be likely.

I believe that refers to yesterday's replication lag on the machine 
running watchlist queries; the abstract dump process that was hitting 
that particular server was aborted yesterday.

-- brion

>
> -Robert Rohde
>
> [1] 
> http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slow_Server
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Daniel Kinzler
Brion Vibber schrieb:
>> On a related note: I noticed that the meta-info dumps like
>> stub-meta-history.xml.gz etc appear to be generated from the full  
>> history dump -
>> and thus fail if the full history dump fails, and get delayed if the  
>> full
>> history dump gets delayed.
> 
> Quite the opposite; the full history dump is generated from the stub  
> skeleton.

Good to know, thanks for clarifying.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Robert Rohde
There is a thread on enwiki WP:VPT [1] speculating that the sluggish
server performance that some people are seeing is being caused by the
dumper working on enwiki.

This strikes me as implausible, but I thought I'd mention it here in
case it could be true.  I suppose it is at least possible to expand
the dumper enough to have a noticable effect on other aspects of site
performance, but I wouldn't expect it to be likely.

-Robert Rohde

[1] http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slow_Server

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Brion Vibber

El May 13, 2009, a las 7:13, Daniel Kinzler   
escribió:

>> Now, to be even more useful, database dumps should be produced on
>> *regular* intervals.  That way, we can compare various measures
>> such as article growth, link counts or usage of certain words,
>> without having to introduce the exact dump time in the count.
>
> On a related note: I noticed that the meta-info dumps like
> stub-meta-history.xml.gz etc appear to be generated from the full  
> history dump -
> and thus fail if the full history dump fails, and get delayed if the  
> full
> history dump gets delayed.

Quite the opposite; the full history dump is generated from the stub  
skeleton.

-- brion
>
>
> There are a lot of things that can be done with the meta-info alone,  
> and it
> seems that dump should be easy and fast to generate. So I propose to  
> genereate
> it from the database directly, instead of making it depend on the  
> full history
> dump, which is slow and the most likely to break.
>
> -- daniel
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Anthony
On Wed, May 13, 2009 at 10:13 AM, Daniel Kinzler wrote:

> > Now, to be even more useful, database dumps should be produced on
> > *regular* intervals.  That way, we can compare various measures
> > such as article growth, link counts or usage of certain words,
> > without having to introduce the exact dump time in the count.
>
> On a related note: I noticed that the meta-info dumps like
> stub-meta-history.xml.gz etc appear to be generated from the full history
> dump -
> and thus fail if the full history dump fails, and get delayed if the full
> history dump gets delayed.
>

Is that something that changed?  It used to be the other way around.
pages-meta-history.xml.bz2 was generated from stub-meta-history.xml.gz
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Daniel Kinzler
> Now, to be even more useful, database dumps should be produced on
> *regular* intervals.  That way, we can compare various measures
> such as article growth, link counts or usage of certain words,
> without having to introduce the exact dump time in the count.

On a related note: I noticed that the meta-info dumps like
stub-meta-history.xml.gz etc appear to be generated from the full history dump -
and thus fail if the full history dump fails, and get delayed if the full
history dump gets delayed.

There are a lot of things that can be done with the meta-info alone, and it
seems that dump should be easy and fast to generate. So I propose to genereate
it from the database directly, instead of making it depend on the full history
dump, which is slow and the most likely to break.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-12 Thread Tomasz Finc
Aryeh Gregor wrote:
> On Mon, May 11, 2009 at 3:27 PM, Brian  wrote:
>> In my opinion fragmentation of conversations onto evermore mailing lists
>> discourages contribution.
> 
> I have to agree that I don't think the dump discussion traffic seemed
> large enough to warrant a whole new mailing list.

If we find that doesn't work then we'll steer the conversation back to 
wikitech.

But here is my reasoning

The admin list was meant to receive any and all automated mails from the 
backup system and I didn't want to busy the readers of wikitech-l with 
that noise. Previously there was a single recipient of any failures 
which was not very scalable or transparent.

The discussion list was meant to capture consumers who have approached 
me that are active users of the dumps but have no direct involvement 
with mediaiwki and are not regular participants of this list. This runs 
the list of researchers, search engines, etc .. that are not concerned 
with all the other conversation that go on wikitech and simply want 
updates on any changes within the dumps system .

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-12 Thread Aryeh Gregor
On Mon, May 11, 2009 at 3:27 PM, Brian  wrote:
> In my opinion fragmentation of conversations onto evermore mailing lists
> discourages contribution.

I have to agree that I don't think the dump discussion traffic seemed
large enough to warrant a whole new mailing list.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-11 Thread Brian
In my opinion fragmentation of conversations onto evermore mailing lists
discourages contribution.

On Mon, May 11, 2009 at 1:04 PM, Tomasz Finc  wrote:

> Andreas Meier wrote:
> > Tomasz Finc schrieb:
> >> Tomasz Finc wrote:
> >>> Russell Blau wrote:
>  "Erik Zachte"  wrote in message
>  news:002d01c9cd8d$3355beb0$9a013c...@com...
> > Tomasz, the amount of dump power that you managed to activate is
> > impressive.
> > 136 dumps yesterday, today already 110 :-) Out of 760 total.
> > Of course there are small en large dumps, but this is very
> encouraging.
> >
>  Yes, thank you Tomasz for your attention to this.  The commonswiki
> process
>  looks like it *might* be dead, by the way.
> >>> Don't think so as I actively see it being updated. It's currently set
> to
> >>> to finish it's second to last step on 2009-05-06 02:53:21.
> >>>
> >>> No one touch anything while its still going ;)
> >> Commons finished just fine along with every single one of the other
> >> small & mid size wiki's waiting to be picked up. Now were just left with
> >> the big sized wiki's to finish.
> >>
> >
> > First many thanks to Tomasz for the now running system.
> >
> > Now there are two running dumps of frwiki at the same time:
> > http://download.wikipedia.org/frwiki/20090509/ and
> > http://download.wikipedia.org/frwiki/20090506/
> > I don't know, if this was intented. Usually this should not happen.
> >
>
> This has been dealt with. Let's take any further operations
> conversations over to xmldatadumps-admi...@lists.wikimedia.org
>
> --tomasz
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-11 Thread Tomasz Finc
Andreas Meier wrote:
> Tomasz Finc schrieb:
>> Tomasz Finc wrote:
>>> Russell Blau wrote:
 "Erik Zachte"  wrote in message 
 news:002d01c9cd8d$3355beb0$9a013c...@com...
> Tomasz, the amount of dump power that you managed to activate is 
> impressive.
> 136 dumps yesterday, today already 110 :-) Out of 760 total.
> Of course there are small en large dumps, but this is very encouraging.
>
 Yes, thank you Tomasz for your attention to this.  The commonswiki process 
 looks like it *might* be dead, by the way.
>>> Don't think so as I actively see it being updated. It's currently set to 
>>> to finish it's second to last step on 2009-05-06 02:53:21.
>>>
>>> No one touch anything while its still going ;)
>> Commons finished just fine along with every single one of the other 
>> small & mid size wiki's waiting to be picked up. Now were just left with 
>> the big sized wiki's to finish.
>>
> 
> First many thanks to Tomasz for the now running system.
> 
> Now there are two running dumps of frwiki at the same time: 
> http://download.wikipedia.org/frwiki/20090509/ and
> http://download.wikipedia.org/frwiki/20090506/
> I don't know, if this was intented. Usually this should not happen.
> 

This has been dealt with. Let's take any further operations 
conversations over to xmldatadumps-admi...@lists.wikimedia.org

--tomasz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-09 Thread Andreas Meier
Tomasz Finc schrieb:
> Tomasz Finc wrote:
>> Russell Blau wrote:
>>> "Erik Zachte"  wrote in message 
>>> news:002d01c9cd8d$3355beb0$9a013c...@com...
 Tomasz, the amount of dump power that you managed to activate is 
 impressive.
 136 dumps yesterday, today already 110 :-) Out of 760 total.
 Of course there are small en large dumps, but this is very encouraging.

>>> Yes, thank you Tomasz for your attention to this.  The commonswiki process 
>>> looks like it *might* be dead, by the way.
>> Don't think so as I actively see it being updated. It's currently set to 
>> to finish it's second to last step on 2009-05-06 02:53:21.
>>
>> No one touch anything while its still going ;)
> 
> Commons finished just fine along with every single one of the other 
> small & mid size wiki's waiting to be picked up. Now were just left with 
> the big sized wiki's to finish.
> 

First many thanks to Tomasz for the now running system.

Now there are two running dumps of frwiki at the same time: 
http://download.wikipedia.org/frwiki/20090509/ and
http://download.wikipedia.org/frwiki/20090506/
I don't know, if this was intented. Usually this should not happen.

Best regards

Andim


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Dump throughput

2009-05-08 Thread Erik Zachte
Lars wrote

> Now, to be even more useful, database dumps should be produced on
> *regular* intervals.  That way, we can compare various measures
> such as article growth, link counts or usage of certain words,
> without having to introduce the exact dump time in the count.

That would complicate matters further though. 
Also as each new dump for some wiki takes a little longer 
this would mean lots of slack in the schedule and 
force servers to be idle part of the time.

I hate to distract Tomasz from optimizing the dump process,
so I'll postpone new feature requests, but an option to order 
gift wrappings for the dumps would be neat :-)

Erik Zachte





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-07 Thread Lars Aronsson
Tomasz Finc wrote:

> Commons finished just fine along with every single one of the 
> other small & mid size wiki's waiting to be picked up. Now were 
> just left with the big sized wiki's to finish.

The new dump processes started on May 1 and sped up to twelve 
processes on May 4.  As of yesterday May 7, dumps have started on 
all databases.  While the big ones (enwiki, dewiki, ...) are still 
running, tokiponawiktionary is the first to have its second dump 
in this round.  They were produced on May 1 and 7.  Soon, all 
small and medium sized databases will have multiple dumps, with 
roughly 4 day intervals.  This is a real improvement over the 
previous 12 months, and I really hope we don't fall down again.

Now, to be even more useful, database dumps should be produced on 
*regular* intervals.  That way, we can compare various measures 
such as article growth, link counts or usage of certain words, 
without having to introduce the exact dump time in the count.

An easy way to implement this is to delay the next dump of a 
database to exactly one week after the previous dump started.

For example, the last dump of svwiki (Swedish Wikipedia) started 
at 20:48 (UTC) on Tuesday May 5.  So let this time of week (20:48 
on Tuesdays) be the timeslot for svwiki. If its turn comes up any 
earlier, the next dump should be delayed until 20:48 on May 12.

That way, the number of mentions of "EU parliament" (elections are 
due on June 7) can be compared on a weekly (7 day) basis, rather 
than on a 5-and-a-half day basis.  The 7 day interval removes any 
measurement bias from weekday/weekend variations.

Another advantage is that we can expect new dumps of svwiki by 
Wednesday lunch, and can plan our weekly projects accordingly.

This plan does not help the larger projects, which take many days 
to dump.  They would still benefit from optimiziations of the dump 
process itself.  Right now the enwiki is extracting "page 
abstracts for Yahoo" and will continue to do so until May 21.  I 
really hope Yahoo appreciates this, or else the current dump 
should be advanced to its next stage to save days and weeks. Maybe 
the pages-articles.xml part of the dump can be produced on a 
regular weekly (or fortnightly) basis even for the larger 
projects, while the other parts are produced more seldom.



-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-07 Thread Tomasz Finc
Russell Blau wrote:
> "Tomasz Finc"  wrote in message 
> news:4a032be3.60...@wikimedia.org...
>> Commons finished just fine along with every single one of the other
>> small & mid size wiki's waiting to be picked up. Now were just left with
>> the big sized wiki's to finish.
> 
> This is probably a stupid question (because it depends on umpteen different 
> variables), but would the remaining "big sized wiki's" finish any faster if 
> you stopped the dump processes for the smaller wikis that have already had a 
> dump complete within the past week and are now starting on their second 
> rounds?
> 

Not a bad question at all. I've actually been turning down the amount of 
work to see if it improves any of the larger ones. No increase in 
processing just yet.

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-07 Thread Russell Blau

"Tomasz Finc"  wrote in message 
news:4a032be3.60...@wikimedia.org...
>
> Commons finished just fine along with every single one of the other
> small & mid size wiki's waiting to be picked up. Now were just left with
> the big sized wiki's to finish.

This is probably a stupid question (because it depends on umpteen different 
variables), but would the remaining "big sized wiki's" finish any faster if 
you stopped the dump processes for the smaller wikis that have already had a 
dump complete within the past week and are now starting on their second 
rounds?

Russ




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-07 Thread Tomasz Finc
Tomasz Finc wrote:
> Russell Blau wrote:
>> "Erik Zachte"  wrote in message 
>> news:002d01c9cd8d$3355beb0$9a013c...@com...
>>> Tomasz, the amount of dump power that you managed to activate is 
>>> impressive.
>>> 136 dumps yesterday, today already 110 :-) Out of 760 total.
>>> Of course there are small en large dumps, but this is very encouraging.
>>>
>> Yes, thank you Tomasz for your attention to this.  The commonswiki process 
>> looks like it *might* be dead, by the way.
> 
> Don't think so as I actively see it being updated. It's currently set to 
> to finish it's second to last step on 2009-05-06 02:53:21.
> 
> No one touch anything while its still going ;)

Commons finished just fine along with every single one of the other 
small & mid size wiki's waiting to be picked up. Now were just left with 
the big sized wiki's to finish.

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-05 Thread Tomasz Finc
Russell Blau wrote:
> "Erik Zachte"  wrote in message 
> news:002d01c9cd8d$3355beb0$9a013c...@com...
>> Tomasz, the amount of dump power that you managed to activate is 
>> impressive.
>> 136 dumps yesterday, today already 110 :-) Out of 760 total.
>> Of course there are small en large dumps, but this is very encouraging.
>>
> 
> Yes, thank you Tomasz for your attention to this.  The commonswiki process 
> looks like it *might* be dead, by the way.

Don't think so as I actively see it being updated. It's currently set to 
to finish it's second to last step on 2009-05-06 02:53:21.

No one touch anything while its still going ;)


--tomasz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-05 Thread Bilal Abdul Kader
Hi Tomasz,
Any ideas about a fresher dump of enwiki-meta-pages-history?

bilal


>
> 2009/5/5 Erik Zachte 
>
> > Tomasz, the amount of dump power that you managed to activate is
> > impressive.
> > 136 dumps yesterday, today already 110 :-) Out of 760 total.
> > Of course there are small en large dumps, but this is very encouraging.
> >
> > Erik Zachte
> >
> >
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-05 Thread Gerard Meijssen
Hoi,
This is the kind of news that will make many people happy. Obviously what
every one is waiting for is the en.wp to finish .. :) But it is great to
have many moments to be happy.
thanks,
   GerardM

2009/5/5 Erik Zachte 

> Tomasz, the amount of dump power that you managed to activate is
> impressive.
> 136 dumps yesterday, today already 110 :-) Out of 760 total.
> Of course there are small en large dumps, but this is very encouraging.
>
> Erik Zachte
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-05 Thread Russell Blau
"Erik Zachte"  wrote in message 
news:002d01c9cd8d$3355beb0$9a013c...@com...
> Tomasz, the amount of dump power that you managed to activate is 
> impressive.
> 136 dumps yesterday, today already 110 :-) Out of 760 total.
> Of course there are small en large dumps, but this is very encouraging.
>

Yes, thank you Tomasz for your attention to this.  The commonswiki process 
looks like it *might* be dead, by the way.

Russ




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Dump throughput

2009-05-05 Thread Erik Zachte
Tomasz, the amount of dump power that you managed to activate is impressive.
136 dumps yesterday, today already 110 :-) Out of 760 total.
Of course there are small en large dumps, but this is very encouraging.

Erik Zachte



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l