Re: [Wikitech-l] Dump throughput

2009-05-14 Thread Alex
Brion Vibber wrote:
 I believe that refers to yesterday's replication lag on the machine 
 running watchlist queries; the abstract dump process that was hitting 
 that particular server was aborted yesterday.
 

Is Yahoo still using those? Looking at the last successful one for
enwiki, it looks like it took a little more than a day to generate.
Combined with all the other projects, that seems like an awful lot of
processing time spent for something of questionable utility to anyone
but Yahoo.

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Anthony
On Wed, May 13, 2009 at 10:13 AM, Daniel Kinzler dan...@brightbyte.dewrote:

  Now, to be even more useful, database dumps should be produced on
  *regular* intervals.  That way, we can compare various measures
  such as article growth, link counts or usage of certain words,
  without having to introduce the exact dump time in the count.

 On a related note: I noticed that the meta-info dumps like
 stub-meta-history.xml.gz etc appear to be generated from the full history
 dump -
 and thus fail if the full history dump fails, and get delayed if the full
 history dump gets delayed.


Is that something that changed?  It used to be the other way around.
pages-meta-history.xml.bz2 was generated from stub-meta-history.xml.gz
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-13 Thread Daniel Kinzler
Brion Vibber schrieb:
 On a related note: I noticed that the meta-info dumps like
 stub-meta-history.xml.gz etc appear to be generated from the full  
 history dump -
 and thus fail if the full history dump fails, and get delayed if the  
 full
 history dump gets delayed.
 
 Quite the opposite; the full history dump is generated from the stub  
 skeleton.

Good to know, thanks for clarifying.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-12 Thread Aryeh Gregor
On Mon, May 11, 2009 at 3:27 PM, Brian brian.min...@colorado.edu wrote:
 In my opinion fragmentation of conversations onto evermore mailing lists
 discourages contribution.

I have to agree that I don't think the dump discussion traffic seemed
large enough to warrant a whole new mailing list.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-12 Thread Tomasz Finc
Aryeh Gregor wrote:
 On Mon, May 11, 2009 at 3:27 PM, Brian brian.min...@colorado.edu wrote:
 In my opinion fragmentation of conversations onto evermore mailing lists
 discourages contribution.
 
 I have to agree that I don't think the dump discussion traffic seemed
 large enough to warrant a whole new mailing list.

If we find that doesn't work then we'll steer the conversation back to 
wikitech.

But here is my reasoning

The admin list was meant to receive any and all automated mails from the 
backup system and I didn't want to busy the readers of wikitech-l with 
that noise. Previously there was a single recipient of any failures 
which was not very scalable or transparent.

The discussion list was meant to capture consumers who have approached 
me that are active users of the dumps but have no direct involvement 
with mediaiwki and are not regular participants of this list. This runs 
the list of researchers, search engines, etc .. that are not concerned 
with all the other conversation that go on wikitech and simply want 
updates on any changes within the dumps system .

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-11 Thread Brian
In my opinion fragmentation of conversations onto evermore mailing lists
discourages contribution.

On Mon, May 11, 2009 at 1:04 PM, Tomasz Finc tf...@wikimedia.org wrote:

 Andreas Meier wrote:
  Tomasz Finc schrieb:
  Tomasz Finc wrote:
  Russell Blau wrote:
  Erik Zachte erikzac...@infodisiac.com wrote in message
  news:002d01c9cd8d$3355beb0$9a013c...@com...
  Tomasz, the amount of dump power that you managed to activate is
  impressive.
  136 dumps yesterday, today already 110 :-) Out of 760 total.
  Of course there are small en large dumps, but this is very
 encouraging.
 
  Yes, thank you Tomasz for your attention to this.  The commonswiki
 process
  looks like it *might* be dead, by the way.
  Don't think so as I actively see it being updated. It's currently set
 to
  to finish it's second to last step on 2009-05-06 02:53:21.
 
  No one touch anything while its still going ;)
  Commons finished just fine along with every single one of the other
  small  mid size wiki's waiting to be picked up. Now were just left with
  the big sized wiki's to finish.
 
 
  First many thanks to Tomasz for the now running system.
 
  Now there are two running dumps of frwiki at the same time:
  http://download.wikipedia.org/frwiki/20090509/ and
  http://download.wikipedia.org/frwiki/20090506/
  I don't know, if this was intented. Usually this should not happen.
 

 This has been dealt with. Let's take any further operations
 conversations over to xmldatadumps-admi...@lists.wikimedia.org

 --tomasz


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-11 Thread Tomasz Finc
Andreas Meier wrote:
 Tomasz Finc schrieb:
 Tomasz Finc wrote:
 Russell Blau wrote:
 Erik Zachte erikzac...@infodisiac.com wrote in message 
 news:002d01c9cd8d$3355beb0$9a013c...@com...
 Tomasz, the amount of dump power that you managed to activate is 
 impressive.
 136 dumps yesterday, today already 110 :-) Out of 760 total.
 Of course there are small en large dumps, but this is very encouraging.

 Yes, thank you Tomasz for your attention to this.  The commonswiki process 
 looks like it *might* be dead, by the way.
 Don't think so as I actively see it being updated. It's currently set to 
 to finish it's second to last step on 2009-05-06 02:53:21.

 No one touch anything while its still going ;)
 Commons finished just fine along with every single one of the other 
 small  mid size wiki's waiting to be picked up. Now were just left with 
 the big sized wiki's to finish.

 
 First many thanks to Tomasz for the now running system.
 
 Now there are two running dumps of frwiki at the same time: 
 http://download.wikipedia.org/frwiki/20090509/ and
 http://download.wikipedia.org/frwiki/20090506/
 I don't know, if this was intented. Usually this should not happen.
 

This has been dealt with. Let's take any further operations 
conversations over to xmldatadumps-admi...@lists.wikimedia.org

--tomasz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-09 Thread Andreas Meier
Tomasz Finc schrieb:
 Tomasz Finc wrote:
 Russell Blau wrote:
 Erik Zachte erikzac...@infodisiac.com wrote in message 
 news:002d01c9cd8d$3355beb0$9a013c...@com...
 Tomasz, the amount of dump power that you managed to activate is 
 impressive.
 136 dumps yesterday, today already 110 :-) Out of 760 total.
 Of course there are small en large dumps, but this is very encouraging.

 Yes, thank you Tomasz for your attention to this.  The commonswiki process 
 looks like it *might* be dead, by the way.
 Don't think so as I actively see it being updated. It's currently set to 
 to finish it's second to last step on 2009-05-06 02:53:21.

 No one touch anything while its still going ;)
 
 Commons finished just fine along with every single one of the other 
 small  mid size wiki's waiting to be picked up. Now were just left with 
 the big sized wiki's to finish.
 

First many thanks to Tomasz for the now running system.

Now there are two running dumps of frwiki at the same time: 
http://download.wikipedia.org/frwiki/20090509/ and
http://download.wikipedia.org/frwiki/20090506/
I don't know, if this was intented. Usually this should not happen.

Best regards

Andim


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-08 Thread Lars Aronsson
Tomasz Finc wrote:

 Commons finished just fine along with every single one of the 
 other small  mid size wiki's waiting to be picked up. Now were 
 just left with the big sized wiki's to finish.

The new dump processes started on May 1 and sped up to twelve 
processes on May 4.  As of yesterday May 7, dumps have started on 
all databases.  While the big ones (enwiki, dewiki, ...) are still 
running, tokiponawiktionary is the first to have its second dump 
in this round.  They were produced on May 1 and 7.  Soon, all 
small and medium sized databases will have multiple dumps, with 
roughly 4 day intervals.  This is a real improvement over the 
previous 12 months, and I really hope we don't fall down again.

Now, to be even more useful, database dumps should be produced on 
*regular* intervals.  That way, we can compare various measures 
such as article growth, link counts or usage of certain words, 
without having to introduce the exact dump time in the count.

An easy way to implement this is to delay the next dump of a 
database to exactly one week after the previous dump started.

For example, the last dump of svwiki (Swedish Wikipedia) started 
at 20:48 (UTC) on Tuesday May 5.  So let this time of week (20:48 
on Tuesdays) be the timeslot for svwiki. If its turn comes up any 
earlier, the next dump should be delayed until 20:48 on May 12.

That way, the number of mentions of EU parliament (elections are 
due on June 7) can be compared on a weekly (7 day) basis, rather 
than on a 5-and-a-half day basis.  The 7 day interval removes any 
measurement bias from weekday/weekend variations.

Another advantage is that we can expect new dumps of svwiki by 
Wednesday lunch, and can plan our weekly projects accordingly.

This plan does not help the larger projects, which take many days 
to dump.  They would still benefit from optimiziations of the dump 
process itself.  Right now the enwiki is extracting page 
abstracts for Yahoo and will continue to do so until May 21.  I 
really hope Yahoo appreciates this, or else the current dump 
should be advanced to its next stage to save days and weeks. Maybe 
the pages-articles.xml part of the dump can be produced on a 
regular weekly (or fortnightly) basis even for the larger 
projects, while the other parts are produced more seldom.



-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Dump throughput

2009-05-08 Thread Erik Zachte
Lars wrote

 Now, to be even more useful, database dumps should be produced on
 *regular* intervals.  That way, we can compare various measures
 such as article growth, link counts or usage of certain words,
 without having to introduce the exact dump time in the count.

That would complicate matters further though. 
Also as each new dump for some wiki takes a little longer 
this would mean lots of slack in the schedule and 
force servers to be idle part of the time.

I hate to distract Tomasz from optimizing the dump process,
so I'll postpone new feature requests, but an option to order 
gift wrappings for the dumps would be neat :-)

Erik Zachte





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-07 Thread Tomasz Finc
Tomasz Finc wrote:
 Russell Blau wrote:
 Erik Zachte erikzac...@infodisiac.com wrote in message 
 news:002d01c9cd8d$3355beb0$9a013c...@com...
 Tomasz, the amount of dump power that you managed to activate is 
 impressive.
 136 dumps yesterday, today already 110 :-) Out of 760 total.
 Of course there are small en large dumps, but this is very encouraging.

 Yes, thank you Tomasz for your attention to this.  The commonswiki process 
 looks like it *might* be dead, by the way.
 
 Don't think so as I actively see it being updated. It's currently set to 
 to finish it's second to last step on 2009-05-06 02:53:21.
 
 No one touch anything while its still going ;)

Commons finished just fine along with every single one of the other 
small  mid size wiki's waiting to be picked up. Now were just left with 
the big sized wiki's to finish.

--tomasz

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump throughput

2009-05-07 Thread Russell Blau

Tomasz Finc tf...@wikimedia.org wrote in message 
news:4a032be3.60...@wikimedia.org...

 Commons finished just fine along with every single one of the other
 small  mid size wiki's waiting to be picked up. Now were just left with
 the big sized wiki's to finish.

This is probably a stupid question (because it depends on umpteen different 
variables), but would the remaining big sized wiki's finish any faster if 
you stopped the dump processes for the smaller wikis that have already had a 
dump complete within the past week and are now starting on their second 
rounds?

Russ




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Dump throughput

2009-05-05 Thread Erik Zachte
Tomasz, the amount of dump power that you managed to activate is impressive.
136 dumps yesterday, today already 110 :-) Out of 760 total.
Of course there are small en large dumps, but this is very encouraging.

Erik Zachte



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l