Re: [Wikitech-l] Dump throughput
El 5/14/09 2:17 PM, Alex escribió: > Brion Vibber wrote: >> I believe that refers to yesterday's replication lag on the machine >> running watchlist queries; the abstract dump process that was hitting >> that particular server was aborted yesterday. >> > > Is Yahoo still using those? Looking at the last successful one for > enwiki, it looks like it took a little more than a day to generate. > Combined with all the other projects, that seems like an awful lot of > processing time spent for something of questionable utility to anyone > but Yahoo. Actually, yes. :) I've occasionally heard from other folks using them for stuff, but indeed Yahoo is still grabbing them. The script needs some cleaning up, and it wouldn't hurt to rearchitect how it's generated in general. (First-sentence summary extraction is also being done in OpenSearchXml for IE 8's search support, and I've improved the implementation there. It should get merged back, and probably merged into core so we can make the extracts more generally available for other uses.) -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Brion Vibber wrote: > I believe that refers to yesterday's replication lag on the machine > running watchlist queries; the abstract dump process that was hitting > that particular server was aborted yesterday. > Is Yahoo still using those? Looking at the last successful one for enwiki, it looks like it took a little more than a day to generate. Combined with all the other projects, that seems like an awful lot of processing time spent for something of questionable utility to anyone but Yahoo. -- Alex (wikipedia:en:User:Mr.Z-man) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
El 5/13/09 9:06 PM, Robert Rohde escribió: > There is a thread on enwiki WP:VPT [1] speculating that the sluggish > server performance that some people are seeing is being caused by the > dumper working on enwiki. > > This strikes me as implausible, but I thought I'd mention it here in > case it could be true. I suppose it is at least possible to expand > the dumper enough to have a noticable effect on other aspects of site > performance, but I wouldn't expect it to be likely. I believe that refers to yesterday's replication lag on the machine running watchlist queries; the abstract dump process that was hitting that particular server was aborted yesterday. -- brion > > -Robert Rohde > > [1] > http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slow_Server > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Brion Vibber schrieb: >> On a related note: I noticed that the meta-info dumps like >> stub-meta-history.xml.gz etc appear to be generated from the full >> history dump - >> and thus fail if the full history dump fails, and get delayed if the >> full >> history dump gets delayed. > > Quite the opposite; the full history dump is generated from the stub > skeleton. Good to know, thanks for clarifying. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
There is a thread on enwiki WP:VPT [1] speculating that the sluggish server performance that some people are seeing is being caused by the dumper working on enwiki. This strikes me as implausible, but I thought I'd mention it here in case it could be true. I suppose it is at least possible to expand the dumper enough to have a noticable effect on other aspects of site performance, but I wouldn't expect it to be likely. -Robert Rohde [1] http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Slow_Server ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
El May 13, 2009, a las 7:13, Daniel Kinzler escribió: >> Now, to be even more useful, database dumps should be produced on >> *regular* intervals. That way, we can compare various measures >> such as article growth, link counts or usage of certain words, >> without having to introduce the exact dump time in the count. > > On a related note: I noticed that the meta-info dumps like > stub-meta-history.xml.gz etc appear to be generated from the full > history dump - > and thus fail if the full history dump fails, and get delayed if the > full > history dump gets delayed. Quite the opposite; the full history dump is generated from the stub skeleton. -- brion > > > There are a lot of things that can be done with the meta-info alone, > and it > seems that dump should be easy and fast to generate. So I propose to > genereate > it from the database directly, instead of making it depend on the > full history > dump, which is slow and the most likely to break. > > -- daniel > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
On Wed, May 13, 2009 at 10:13 AM, Daniel Kinzler wrote: > > Now, to be even more useful, database dumps should be produced on > > *regular* intervals. That way, we can compare various measures > > such as article growth, link counts or usage of certain words, > > without having to introduce the exact dump time in the count. > > On a related note: I noticed that the meta-info dumps like > stub-meta-history.xml.gz etc appear to be generated from the full history > dump - > and thus fail if the full history dump fails, and get delayed if the full > history dump gets delayed. > Is that something that changed? It used to be the other way around. pages-meta-history.xml.bz2 was generated from stub-meta-history.xml.gz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
> Now, to be even more useful, database dumps should be produced on > *regular* intervals. That way, we can compare various measures > such as article growth, link counts or usage of certain words, > without having to introduce the exact dump time in the count. On a related note: I noticed that the meta-info dumps like stub-meta-history.xml.gz etc appear to be generated from the full history dump - and thus fail if the full history dump fails, and get delayed if the full history dump gets delayed. There are a lot of things that can be done with the meta-info alone, and it seems that dump should be easy and fast to generate. So I propose to genereate it from the database directly, instead of making it depend on the full history dump, which is slow and the most likely to break. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Aryeh Gregor wrote: > On Mon, May 11, 2009 at 3:27 PM, Brian wrote: >> In my opinion fragmentation of conversations onto evermore mailing lists >> discourages contribution. > > I have to agree that I don't think the dump discussion traffic seemed > large enough to warrant a whole new mailing list. If we find that doesn't work then we'll steer the conversation back to wikitech. But here is my reasoning The admin list was meant to receive any and all automated mails from the backup system and I didn't want to busy the readers of wikitech-l with that noise. Previously there was a single recipient of any failures which was not very scalable or transparent. The discussion list was meant to capture consumers who have approached me that are active users of the dumps but have no direct involvement with mediaiwki and are not regular participants of this list. This runs the list of researchers, search engines, etc .. that are not concerned with all the other conversation that go on wikitech and simply want updates on any changes within the dumps system . --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
On Mon, May 11, 2009 at 3:27 PM, Brian wrote: > In my opinion fragmentation of conversations onto evermore mailing lists > discourages contribution. I have to agree that I don't think the dump discussion traffic seemed large enough to warrant a whole new mailing list. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
In my opinion fragmentation of conversations onto evermore mailing lists discourages contribution. On Mon, May 11, 2009 at 1:04 PM, Tomasz Finc wrote: > Andreas Meier wrote: > > Tomasz Finc schrieb: > >> Tomasz Finc wrote: > >>> Russell Blau wrote: > "Erik Zachte" wrote in message > news:002d01c9cd8d$3355beb0$9a013c...@com... > > Tomasz, the amount of dump power that you managed to activate is > > impressive. > > 136 dumps yesterday, today already 110 :-) Out of 760 total. > > Of course there are small en large dumps, but this is very > encouraging. > > > Yes, thank you Tomasz for your attention to this. The commonswiki > process > looks like it *might* be dead, by the way. > >>> Don't think so as I actively see it being updated. It's currently set > to > >>> to finish it's second to last step on 2009-05-06 02:53:21. > >>> > >>> No one touch anything while its still going ;) > >> Commons finished just fine along with every single one of the other > >> small & mid size wiki's waiting to be picked up. Now were just left with > >> the big sized wiki's to finish. > >> > > > > First many thanks to Tomasz for the now running system. > > > > Now there are two running dumps of frwiki at the same time: > > http://download.wikipedia.org/frwiki/20090509/ and > > http://download.wikipedia.org/frwiki/20090506/ > > I don't know, if this was intented. Usually this should not happen. > > > > This has been dealt with. Let's take any further operations > conversations over to xmldatadumps-admi...@lists.wikimedia.org > > --tomasz > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Andreas Meier wrote: > Tomasz Finc schrieb: >> Tomasz Finc wrote: >>> Russell Blau wrote: "Erik Zachte" wrote in message news:002d01c9cd8d$3355beb0$9a013c...@com... > Tomasz, the amount of dump power that you managed to activate is > impressive. > 136 dumps yesterday, today already 110 :-) Out of 760 total. > Of course there are small en large dumps, but this is very encouraging. > Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way. >>> Don't think so as I actively see it being updated. It's currently set to >>> to finish it's second to last step on 2009-05-06 02:53:21. >>> >>> No one touch anything while its still going ;) >> Commons finished just fine along with every single one of the other >> small & mid size wiki's waiting to be picked up. Now were just left with >> the big sized wiki's to finish. >> > > First many thanks to Tomasz for the now running system. > > Now there are two running dumps of frwiki at the same time: > http://download.wikipedia.org/frwiki/20090509/ and > http://download.wikipedia.org/frwiki/20090506/ > I don't know, if this was intented. Usually this should not happen. > This has been dealt with. Let's take any further operations conversations over to xmldatadumps-admi...@lists.wikimedia.org --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Tomasz Finc schrieb: > Tomasz Finc wrote: >> Russell Blau wrote: >>> "Erik Zachte" wrote in message >>> news:002d01c9cd8d$3355beb0$9a013c...@com... Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging. >>> Yes, thank you Tomasz for your attention to this. The commonswiki process >>> looks like it *might* be dead, by the way. >> Don't think so as I actively see it being updated. It's currently set to >> to finish it's second to last step on 2009-05-06 02:53:21. >> >> No one touch anything while its still going ;) > > Commons finished just fine along with every single one of the other > small & mid size wiki's waiting to be picked up. Now were just left with > the big sized wiki's to finish. > First many thanks to Tomasz for the now running system. Now there are two running dumps of frwiki at the same time: http://download.wikipedia.org/frwiki/20090509/ and http://download.wikipedia.org/frwiki/20090506/ I don't know, if this was intented. Usually this should not happen. Best regards Andim ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Dump throughput
Lars wrote > Now, to be even more useful, database dumps should be produced on > *regular* intervals. That way, we can compare various measures > such as article growth, link counts or usage of certain words, > without having to introduce the exact dump time in the count. That would complicate matters further though. Also as each new dump for some wiki takes a little longer this would mean lots of slack in the schedule and force servers to be idle part of the time. I hate to distract Tomasz from optimizing the dump process, so I'll postpone new feature requests, but an option to order gift wrappings for the dumps would be neat :-) Erik Zachte ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Tomasz Finc wrote: > Commons finished just fine along with every single one of the > other small & mid size wiki's waiting to be picked up. Now were > just left with the big sized wiki's to finish. The new dump processes started on May 1 and sped up to twelve processes on May 4. As of yesterday May 7, dumps have started on all databases. While the big ones (enwiki, dewiki, ...) are still running, tokiponawiktionary is the first to have its second dump in this round. They were produced on May 1 and 7. Soon, all small and medium sized databases will have multiple dumps, with roughly 4 day intervals. This is a real improvement over the previous 12 months, and I really hope we don't fall down again. Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count. An easy way to implement this is to delay the next dump of a database to exactly one week after the previous dump started. For example, the last dump of svwiki (Swedish Wikipedia) started at 20:48 (UTC) on Tuesday May 5. So let this time of week (20:48 on Tuesdays) be the timeslot for svwiki. If its turn comes up any earlier, the next dump should be delayed until 20:48 on May 12. That way, the number of mentions of "EU parliament" (elections are due on June 7) can be compared on a weekly (7 day) basis, rather than on a 5-and-a-half day basis. The 7 day interval removes any measurement bias from weekday/weekend variations. Another advantage is that we can expect new dumps of svwiki by Wednesday lunch, and can plan our weekly projects accordingly. This plan does not help the larger projects, which take many days to dump. They would still benefit from optimiziations of the dump process itself. Right now the enwiki is extracting "page abstracts for Yahoo" and will continue to do so until May 21. I really hope Yahoo appreciates this, or else the current dump should be advanced to its next stage to save days and weeks. Maybe the pages-articles.xml part of the dump can be produced on a regular weekly (or fortnightly) basis even for the larger projects, while the other parts are produced more seldom. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Russell Blau wrote: > "Tomasz Finc" wrote in message > news:4a032be3.60...@wikimedia.org... >> Commons finished just fine along with every single one of the other >> small & mid size wiki's waiting to be picked up. Now were just left with >> the big sized wiki's to finish. > > This is probably a stupid question (because it depends on umpteen different > variables), but would the remaining "big sized wiki's" finish any faster if > you stopped the dump processes for the smaller wikis that have already had a > dump complete within the past week and are now starting on their second > rounds? > Not a bad question at all. I've actually been turning down the amount of work to see if it improves any of the larger ones. No increase in processing just yet. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
"Tomasz Finc" wrote in message news:4a032be3.60...@wikimedia.org... > > Commons finished just fine along with every single one of the other > small & mid size wiki's waiting to be picked up. Now were just left with > the big sized wiki's to finish. This is probably a stupid question (because it depends on umpteen different variables), but would the remaining "big sized wiki's" finish any faster if you stopped the dump processes for the smaller wikis that have already had a dump complete within the past week and are now starting on their second rounds? Russ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Tomasz Finc wrote: > Russell Blau wrote: >> "Erik Zachte" wrote in message >> news:002d01c9cd8d$3355beb0$9a013c...@com... >>> Tomasz, the amount of dump power that you managed to activate is >>> impressive. >>> 136 dumps yesterday, today already 110 :-) Out of 760 total. >>> Of course there are small en large dumps, but this is very encouraging. >>> >> Yes, thank you Tomasz for your attention to this. The commonswiki process >> looks like it *might* be dead, by the way. > > Don't think so as I actively see it being updated. It's currently set to > to finish it's second to last step on 2009-05-06 02:53:21. > > No one touch anything while its still going ;) Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish. --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Russell Blau wrote: > "Erik Zachte" wrote in message > news:002d01c9cd8d$3355beb0$9a013c...@com... >> Tomasz, the amount of dump power that you managed to activate is >> impressive. >> 136 dumps yesterday, today already 110 :-) Out of 760 total. >> Of course there are small en large dumps, but this is very encouraging. >> > > Yes, thank you Tomasz for your attention to this. The commonswiki process > looks like it *might* be dead, by the way. Don't think so as I actively see it being updated. It's currently set to to finish it's second to last step on 2009-05-06 02:53:21. No one touch anything while its still going ;) --tomasz ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Hi Tomasz, Any ideas about a fresher dump of enwiki-meta-pages-history? bilal > > 2009/5/5 Erik Zachte > > > Tomasz, the amount of dump power that you managed to activate is > > impressive. > > 136 dumps yesterday, today already 110 :-) Out of 760 total. > > Of course there are small en large dumps, but this is very encouraging. > > > > Erik Zachte > > > > > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
Hoi, This is the kind of news that will make many people happy. Obviously what every one is waiting for is the en.wp to finish .. :) But it is great to have many moments to be happy. thanks, GerardM 2009/5/5 Erik Zachte > Tomasz, the amount of dump power that you managed to activate is > impressive. > 136 dumps yesterday, today already 110 :-) Out of 760 total. > Of course there are small en large dumps, but this is very encouraging. > > Erik Zachte > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump throughput
"Erik Zachte" wrote in message news:002d01c9cd8d$3355beb0$9a013c...@com... > Tomasz, the amount of dump power that you managed to activate is > impressive. > 136 dumps yesterday, today already 110 :-) Out of 760 total. > Of course there are small en large dumps, but this is very encouraging. > Yes, thank you Tomasz for your attention to this. The commonswiki process looks like it *might* be dead, by the way. Russ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Dump throughput
Tomasz, the amount of dump power that you managed to activate is impressive. 136 dumps yesterday, today already 110 :-) Out of 760 total. Of course there are small en large dumps, but this is very encouraging. Erik Zachte ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l