Re: [Wikitech-l] Dump processes seem to be dead
On Thu, Feb 26, 2009 at 4:48 PM, Platonides wrote: > Not only do you need to keep them in the same block. You also need to > keep them inside the compression window. Unless you are going to reorder > those 1M revisions to keep revisions to the same article together. He already said that should be done (each block clustered by page id). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Robert Ullmann wrote: > look at the first three digits of the revid, when they are the same, > they would be in the same "block" (this is assuming 1M revs/block as I > suggested). You can check any title you like (remember _ for space, > and % escapes for a lot of characters, but a good browser will do that > for you in a lot of cases) Since the majority of edits are for a > minority of titles (some version of the 80/20 rule applies), most > edits/revisions will be in the same block as a number of others for > that page. Not only do you need to keep them in the same block. You also need to keep them inside the compression window. Unless you are going to reorder those 1M revisions to keep revisions to the same article together. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Hi, On Thu, Feb 26, 2009 at 2:29 AM, Andrew Garrett wrote: > On Thu, Feb 26, 2009 at 5:08 AM, John Doe wrote: >> But server space saved by compression would be would be compensated by the >> stability, and flexibility provided by this method. this would allow what >> ever server is controlling the dump process to designate and delegate >> parallel processes for the same dump. > > Not nearly -- we're talking about a 100-fold decrease in compression > ratio if we don't compress revisions of the same page adjacent to one > another. > > -- > Andrew Garrett No, not nearly that bad. Keep in mind that ~10x of the compression is just from having English text and repeated XML tags, etc. (Note the compression ratio of the all-articles dump, which has only one revision of each article.) If the revisions in each "block" are sorted by pageid, so that the revs of the same article are together, you'll get a very large part of the other 10x factor. Revisions to pages tend to cluster in time (think edits and reverts :-) as one or more people work on an article, or it is of news interest (see "Slumdog Millionaire" ;-) or whatever. You can see this for any given article, like this: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimit=max&titles=Snail look at the first three digits of the revid, when they are the same, they would be in the same "block" (this is assuming 1M revs/block as I suggested). You can check any title you like (remember _ for space, and % escapes for a lot of characters, but a good browser will do that for you in a lot of cases) Since the majority of edits are for a minority of titles (some version of the 80/20 rule applies), most edits/revisions will be in the same block as a number of others for that page. So we will get most, but not all, of the other 10X compression ratio. But even if the compressed blocks are (say) 20% bigger, the win is that once they are some weeks old, they NEVER need to be re-built. Each dump (which should then be about weekly, with the same compute resource, as the queue runs faster ;-) need only build or re-build a few blocks. (And there is no need at all to parallelize any given dump, just run 3-5 different ones in parallel as now.) best, Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On Thu, Feb 26, 2009 at 5:08 AM, John Doe wrote: > But server space saved by compression would be would be compensated by the > stability, and flexibility provided by this method. this would allow what > ever server is controlling the dump process to designate and delegate > parallel processes for the same dump. Not nearly -- we're talking about a 100-fold decrease in compression ratio if we don't compress revisions of the same page adjacent to one another. -- Andrew Garrett ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
--- El mié, 25/2/09, Robert Ullmann escribió: > De: Robert Ullmann > Asunto: Re: [Wikitech-l] Dump processes seem to be dead > Para: "Wikimedia developers" > Fecha: miércoles, 25 febrero, 2009 2:09 > you > yourself suggested page id. > > I suggest the history be partitioned into > "blocks" by *revision ID* I've checked some alternatives to slice the huge dump files in chunks with a more manageable size. I first thought about dividing the blocks by rev_id, like you suggest. Then, I realized that it can pose some problems for parsers recovering information, since revisions corresponding to the same page may fall in different dump files. Once you have surpassed the page_id tag, you cannot remember it if the process stops due to some error, unless you save breakpoint information to recover it later on, when you restart the process again. Partitioning by page_id, you can maintain all revs of the same page in the same block, while you don't disturb algorithms looking for individual revisions. Yes, the chunks would be slightly bigger, but the difference is not that much with either 7zip or bzip2, and you favor simplicity of recovering tools. Best, F. > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Marco Schuster wrote: > Another idea: If $revision is > deleted/oversighted/whateverhowmadeinvisible, then find out the block > ID for the dump so that only this specific block needs to be > re-created in next dump run. Or, better: do not recreate the dump > block, but only remove the offending revision(s) from it. Shoulda save > a lot of dump preparation time, IMO. > > Marco That's already done. New dumps insert the content from the previous ones (when available, enwiki has a hard time on it). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/25 John Doe : > Id recommend either 10m or 10% of > the database which ever is larger for new dumps to screen out a majority of > the deletions. what are your thoughts on this process brion (and the rest of > the tech team)? Another idea: If $revision is deleted/oversighted/whateverhowmadeinvisible, then find out the block ID for the dump so that only this specific block needs to be re-created in next dump run. Or, better: do not recreate the dump block, but only remove the offending revision(s) from it. Shoulda save a lot of dump preparation time, IMO. Marco -- VMSoft GbR Nabburger Str. 15 81737 München Geschätsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/25 John Doe : > But server space saved by compression would be would be compensated by the > stability, and flexibility provided by this method. True, I didn't mean to say it was a bad idea, I was just pointing out one disadvantage you may not have considered. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On Tue, Feb 24, 2009 at 5:09 PM, Robert Ullmann wrote: > I suggest the history be partitioned into "blocks" by *revision ID* > > Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in > "block 1", and so on. The English Wiktionary at the moment would have > 7 blocks; the English Wikipedia would have 273. Though there are arguments in favor of this, I think they are outweighed by the fact that one would need to go through every block in order to reconstruct the history of even a single page. In my opinion partitioning on page id is a much better idea since it would keep each page's history in a single place. -Robert Rohde ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
But server space saved by compression would be would be compensated by the stability, and flexibility provided by this method. this would allow what ever server is controlling the dump process to designate and delegate parallel processes for the same dump. so block 1 could be on server 1 and block 2 could be on server 3. that would give the flexibility to use as many servers as are available for this task more efficiently. if block 200 of en.wp breaks for some reason you dont have to rebuild the previous 199 blocks you can just delegate a server to rebuild that single block. that would allow the dump process to be a little more crash friendly (even though I know we dont want to admit crashes happen :) ) this also enables the dump time in future dumps to be cut drasticlly. Id recommend either 10m or 10% of the database which ever is larger for new dumps to screen out a majority of the deletions. what are your thoughts on this process brion (and the rest of the tech team)? Betacommand On Wed, Feb 25, 2009 at 9:00 AM, Thomas Dalton wrote: > 2009/2/25 Robert Ullmann : > > I suggest the history be partitioned into "blocks" by *revision ID* > > > > Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in > > "block 1", and so on. The English Wiktionary at the moment would have > > 7 blocks; the English Wikipedia would have 273. > > One problem with that is that you won't get such good compression > ratios. Most of the revisions of a single article are very similar to > the revisions before and after it, so they compress down very small. > If you break up the articles between different blocks you don't get > that advantage (at least, not to the same extent). > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/25 Robert Ullmann : > I suggest the history be partitioned into "blocks" by *revision ID* > > Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in > "block 1", and so on. The English Wiktionary at the moment would have > 7 blocks; the English Wikipedia would have 273. One problem with that is that you won't get such good compression ratios. Most of the revisions of a single article are very similar to the revisions before and after it, so they compress down very small. If you break up the articles between different blocks you don't get that advantage (at least, not to the same extent). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
afaik there are "hands" in amsterdam that can be called upon to do stuff as necessary in the centre like any other hosting customer, but the need is not quite of the same level as tampa due to size, servers there etc. seoul no longer operates so this is not an issue. regards mark On Tue, Feb 24, 2009 at 2:55 PM, Gerard Meijssen wrote: > Hoi, > Is there also a "Rob" in Amsterdam and Seoul ? > Thanks, > GerardM > > 2009/2/24 Aryeh Gregor > < > simetrical%2bwikil...@gmail.com > > > > > > On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton > > wrote: > > > Is there anyone within minutes of the servers at all times? Aren't > > > they at a remote data centre? > > > > Isn't Rob on-site? > > > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
> The worry bit is that it seems srv136 will now work as apache. > So, where will dumps be done? I'm not sure where (or if it has changed), but they are running now (:-) To Ariel Glenn: On getting them to work better in the future, this is what I would suggest: First, note that everything except the "all history" dumps presents no problem. It isn't perfect, but it is workable. The biggest "all pages current" dump is enwiki, which takes about a day and a half, and the compressed output file (bz2) still fits neatly on a DVD. As to the history files, these are the problem; each contains all of the preceding history and they just grow and grow. They must be partitioned somehow. Suggestions have been made concerning alphabetical partitions (very traditional for encyclopaedias ;-); you yourself suggested page id. I suggest the history be partitioned into "blocks" by *revision ID* Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in "block 1", and so on. The English Wiktionary at the moment would have 7 blocks; the English Wikipedia would have 273. The dumps would continue as now up to "all pages current", including the split-stub dump for the history (very important, as it provides the "snapshot" of the DB state). But then when it gets to history, it re-builds the last block done (possibly completing it), and then writes 0-n new ones as needed. Note that (to pick a random number) "block 71" of the enwiki defined this way *has not changed* in a long time; only the current block(s) need to be (re-)written. The history stays the same. (Of course?!) If someone somewhere needs a copy of the wiki with all history as of a given date, they can start with the split-stub for that date and read in all the required blocks. But that isn't your problem any more. (;-) They can do that with their disk and servers. It would probably be best to still sort by page-id order within each block, as they will compress much better that way. One reason to rebuild the last block (or two) is to filter out deleted and oversighted revisions. Deleted and oversighted revisions older than some specific time (a small number of weeks) would remain. But note that that is true *anyway*, as someone can always look at a 3-month old dump under any method. With my best regards, Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Hoi, Is there also a "Rob" in Amsterdam and Seoul ? Thanks, GerardM 2009/2/24 Aryeh Gregor > > On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton > wrote: > > Is there anyone within minutes of the servers at all times? Aren't > > they at a remote data centre? > > Isn't Rob on-site? > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/24 Aryeh Gregor : > On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton > wrote: >> Is there anyone within minutes of the servers at all times? Aren't >> they at a remote data centre? > > Isn't Rob on-site? He's based somewhere near the data centre, but I'm not sure he's actually there unless there is something which needs his attention. He's certainly not there 24/7 (regrettably, WMF is still using human sysadmins...). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton wrote: > Is there anyone within minutes of the servers at all times? Aren't > they at a remote data centre? Isn't Rob on-site? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/24 Robert Ullmann : > When a server is reported down (in this case hard; won't reply to > ping) it should be physically looked at within minutes. Is there anyone within minutes of the servers at all times? Aren't they at a remote data centre? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Robert Ullmann wrote: > All servers should be monitored, on several levels (ping, various > queries, checking processes) Nagios should have been monitoring them. > Someone should be "watching" the monitor 24x7. (being right there, or > by SMS, whatever ;) Don't know if there can be a nagios "silent" failure, where it doesn't get disconnected from irc. > When restarted, the things it was doing should be restarted (this has > not been done yet at this writing). The worry bit is that it seems srv136 will now work as apache. So, where will dumps be done? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Let me ask a separate question (Ariel may be interested in this): What if we took the regular permanent media backups, and WMF filtered them in house just to remove the classified stuff (;-), and then put them somewhere where others could convert them to the desired format(s)? (Build all-history files, whatever.) What is the standard backup procedure? (I ask as I haven't seen any description or reference to it ... :-) Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On Tue, Feb 24, 2009 at 6:49 AM, Andrew Garrett wrote: > On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann wrote: >> Really? I mean is this for real? >> >> The sequence ought to be something like: breaker trips, monitor shows >> within a minute or two that 4 servers are offline, and not scheduled >> to be. In the next 5 minutes someone looks at the server(s), notes >> that there is no AC power, walks directly to the panel and resets the >> breaker. How is this *not* done? I'm sorry, I just don't get it. I've >> run data centres, and it just is not possible to have servers down for >> AC power for more than a few minutes unless there is a fault one can't >> locate. (Or grid down, and running a subset on the generators ;-) >> >> Can someone explain all this? Is the whole thing just completely >> beyond the resource available to manage it? > > Constructive suggestions for improvement are far more welcome than > complaints and outrage. > > If you have no suggestions for improvement, it is perhaps more prudent > to express concern that dumps are not working and to wait for a > response. This is admittedly less fun than piecing together > information and "lining up" those responsible for something not being > operational. Andrew: this is NOT FUN AT ALL. Do you think it is "fun" to have to complain bitterly and repeatedly because simply reporting critical-down problems elicits little or no reply and no corrective action for days and weeks? Fun? Fun? Okay, I'll put it this way: the following should be done: All servers should be monitored, on several levels (ping, various queries, checking processes) Someone should be "watching" the monitor 24x7. (being right there, or by SMS, whatever ;) When a server is reported down (in this case hard; won't reply to ping) it should be physically looked at within minutes. If it has no AC power, the circuit breaker is the first thing to check. When restarted, the things it was doing should be restarted (this has not been done yet at this writing). Now I can say these things as "constructive suggestions", but are they are not of course: they are fundamental operational procedure for a data centre. Please explain to me why I should have to "suggest" them? Eh? I am confused (seriously! I am not being snarky here). What is going on? best, Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann wrote: > Really? I mean is this for real? > > The sequence ought to be something like: breaker trips, monitor shows > within a minute or two that 4 servers are offline, and not scheduled > to be. In the next 5 minutes someone looks at the server(s), notes > that there is no AC power, walks directly to the panel and resets the > breaker. How is this *not* done? I'm sorry, I just don't get it. I've > run data centres, and it just is not possible to have servers down for > AC power for more than a few minutes unless there is a fault one can't > locate. (Or grid down, and running a subset on the generators ;-) > > Can someone explain all this? Is the whole thing just completely > beyond the resource available to manage it? Constructive suggestions for improvement are far more welcome than complaints and outrage. If you have no suggestions for improvement, it is perhaps more prudent to express concern that dumps are not working and to wait for a response. This is admittedly less fun than piecing together information and "lining up" those responsible for something not being operational. -- Andrew Garrett ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Hmm: On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau wrote: > 2) Within the last hour, the server log at > http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found > and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker > was tripped in the data center. So we conclude that Feb 12th: a breaker trips, taking four servers offline (8 days go by, with a number of reports) Feb 20th: it is noted that srv31 is down, (noted that AC is off?) (3 days go by) Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours later, the dumps have not resumed) Really? I mean is this for real? The sequence ought to be something like: breaker trips, monitor shows within a minute or two that 4 servers are offline, and not scheduled to be. In the next 5 minutes someone looks at the server(s), notes that there is no AC power, walks directly to the panel and resets the breaker. How is this *not* done? I'm sorry, I just don't get it. I've run data centres, and it just is not possible to have servers down for AC power for more than a few minutes unless there is a fault one can't locate. (Or grid down, and running a subset on the generators ;-) Can someone explain all this? Is the whole thing just completely beyond the resource available to manage it? Best regards, Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Robert Rohde wrote: > The largest gains are almost certainly going to be in parallelization > though. A single monolithic dumper is impractical for enwiki. > > -Robert Rohde Using dumps compressed per blocks, as the ones I used for http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html would allow several processes/computers to write the same dump on different offsets and reading from the last one on different positions as well. As sharing a transaction between different servers would be tricky, they should probably dump from the previously dumped page.sql.gz Patches on bugs 16082 and 16176 to add Export features are awaiting review ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On 2/23/09 12:13 PM, Ariel T. Glenn wrote: > I asked for it, and that's why it was assigned to me. I should have > recognized much sooner that I could not actually get it done and should > have brought this to Brion's attention instead of continuing to hang on > to it after he brought it to my attention. I've been needing to reprioritize resources for this for a while; all of us having many other things to do at the same time and lots of folks being out sick during cold/flu season may not sound like a good excuse for this dragging on longer than I'd like the last few weeks, but I'm afraid it's the best I can offer at the moment. Anyway, rest assured that this remains very much on my mind -- we haven't forgotten that the current dump process sucks and needs to be fixed up. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On 2/23/09 3:08 AM, Marco Schuster wrote: > Even if you had the dumps, you have another problem: They're > incredibly big and so a bit difficult to parse. So, a small suggestion > if the dumps will ever be workin' again: Split the history and current > db stuff by alphabet, please. Define alphabet -- how should Chinese and Japanese texts be broken up? We're much more likely to break them up simply by page ID. > PS: Are there any measurements what traffic is generated by ppl who > download the dumps? Not currently. > Have there been any attempts to distribute them > via BitTorrent? By third parties, with AFAIK very little usage. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Στις 23-02-2009, ημέρα Δευ, και ώρα 19:02 +, ο/η Thomas Dalton έγραψε: > 2009/2/23 Ariel T. Glenn : > > The reason these dumps are not rewritten more efficiently is that this > > job was handed to me (at my request) and I have not been able to get to > > it, even though it is the first thing on my list for development work. > > So, if there are going to be rants, they can be directed at me, not at > > the whole team. > > > > The work was started already by a volunteer. As I am the blocking > > factor, someone else should probably take it on and get it done, though > > it will make me sad. Brion discussed this with me about a week and a > > half ago and I still wanted to keep it then but it doesn't make sense. > > The in-office needs that I am also responsible for take virtually all of > > my time. Perhaps they shouldn't, but that is how it has worked out. > > In that case, it seems the mistake was assigning what should have been > a top-priority task to someone that couldn't actually make it their > top priority due to other commitments. If someone is unable to > guarantee that they'll have time to do something, they shouldn't be > assigned something so time critical. I asked for it, and that's why it was assigned to me. I should have recognized much sooner that I could not actually get it done and should have brought this to Brion's attention instead of continuing to hang on to it after he brought it to my attention. Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
On Mon, Feb 23, 2009 at 11:08 AM, Alex wrote: > Most of that hasn't been touched in years, and it seems to be mainly a > Python wrapper around the dump scripts in /phase3/maintenance/ which > also don't seem to have had significant changes recently. Has anything > been done recently (in a very broad sense of the word)? Or at least, has > anything been written down about what the plans are? In a "very broad sense" (and not directly connected to main problems), I wrote a compressor [1] that converts full-text history dumps into an "edit syntax" that provides ~95% compression on the larger dumps while keeping it in a plain text format that could still be searched and processed without needing a full decompression. That's one of several ways to modify the way dump process operates in order to make the output easier to work with (if it takes ~2 TB to expand enwiki's full history, then that is not practical for most users even if we solve the problem of generating it). It is not necessarily true that my specific technology is the right answer, but various changes in formatting to aid distribution, generation, and use are one of the areas that ought to be considered when reimplementing the dump process. The largest gains are almost certainly going to be in parallelization though. A single monolithic dumper is impractical for enwiki. -Robert Rohde [1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Most of that hasn't been touched in years, and it seems to be mainly a Python wrapper around the dump scripts in /phase3/maintenance/ which also don't seem to have had significant changes recently. Has anything been done recently (in a very broad sense of the word)? Or at least, has anything been written down about what the plans are? Nicolas Dumazet wrote: > yep, http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ +) > > 2009/2/23 Alex : >> Ariel T. Glenn wrote: >>> The reason these dumps are not rewritten more efficiently is that this >>> job was handed to me (at my request) and I have not been able to get to >>> it, even though it is the first thing on my list for development work. >>> So, if there are going to be rants, they can be directed at me, not at >>> the whole team. >>> >>> The work was started already by a volunteer. As I am the blocking >>> factor, someone else should probably take it on and get it done, though >>> it will make me sad. Brion discussed this with me about a week and a >>> half ago and I still wanted to keep it then but it doesn't make sense. >>> The in-office needs that I am also responsible for take virtually all of >>> my time. Perhaps they shouldn't, but that is how it has worked out. >>> >>> So, I am very sorry for having needlessly held things up. (I also have >>> aa crawler that requests pages changed since the latest xml dump, so >>> that projects I am on can keep a current xml file; we've been running >>> that way for at least a year.) >>> >> Is the source for the new dump system on SVN somewhere? >> >> -- >> Alex (wikipedia:en:User:Mr.Z-man) >> -- Alex (wikipedia:en:User:Mr.Z-man) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/23 Ariel T. Glenn : > The reason these dumps are not rewritten more efficiently is that this > job was handed to me (at my request) and I have not been able to get to > it, even though it is the first thing on my list for development work. > So, if there are going to be rants, they can be directed at me, not at > the whole team. > > The work was started already by a volunteer. As I am the blocking > factor, someone else should probably take it on and get it done, though > it will make me sad. Brion discussed this with me about a week and a > half ago and I still wanted to keep it then but it doesn't make sense. > The in-office needs that I am also responsible for take virtually all of > my time. Perhaps they shouldn't, but that is how it has worked out. In that case, it seems the mistake was assigning what should have been a top-priority task to someone that couldn't actually make it their top priority due to other commitments. If someone is unable to guarantee that they'll have time to do something, they shouldn't be assigned something so time critical. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Ariel, Thank you for giving some insight into what has been going on behind the scenes. I have a few questions that will hopefully get some answers to those of us eager to help out in any way we can. What are the planned code changes to speed the process up? Can we help this volunteer with the coding or architectural decisions? How much time do they have to dedicate to it? Some visibility into the fix and timeline would benefit a lot of us. It would also help us know how we can help out! Thanks again for shedding some light on the issue. On Feb 22, 2009, at 8:12 PM, Ariel T. Glenn wrote: > The reason these dumps are not rewritten more efficiently is that this > job was handed to me (at my request) and I have not been able to get > to > it, even though it is the first thing on my list for development work. > So, if there are going to be rants, they can be directed at me, not at > the whole team. > > The work was started already by a volunteer. As I am the blocking > factor, someone else should probably take it on and get it done, > though > it will make me sad. Brion discussed this with me about a week and a > half ago and I still wanted to keep it then but it doesn't make sense. > The in-office needs that I am also responsible for take virtually > all of > my time. Perhaps they shouldn't, but that is how it has worked out. > > So, I am very sorry for having needlessly held things up. (I also > have > aa crawler that requests pages changed since the latest xml dump, so > that projects I am on can keep a current xml file; we've been running > that way for at least a year.) > > Ariel > > > Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/ > η Gerard Meijssen > έγραψε: >> Hoi, >> There have been previous offers for developer time and for >> hardware... >> Thanks, >> GerardM >> >> 2009/2/23 Platonides >> >>> Robert Ullmann wrote: Hi, Maybe I should offer a constructive suggestion? >>> >>> They are better than rants :) >>> Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc. But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download? >>> >>> I don't think they move backup copies off to secure storage. They >>> have >>> the db replicated and the backup discs would be copies of that same >>> dumps. (Some sysadmin to confirm?) >>> Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file). There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working. No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups. >>> >>> The problem is that WMF can't provide that raw unfiltered >>> information. >>> Perhaps you could donate a box on the condition that it could only >>> be >>> used for dump processing, but giving out unfiltered data would be >>> too >>> risky. >>> >>> >>> >>> ___ >>> Wikitech-l mailing list >>> Wikitech-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l >>> >> ___ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Thanks for the update Russell! On Feb 23, 2009, at 10:04 AM, Russell Blau wrote: > "Russell Blau" wrote in message > news:gnuacf$hf...@ger.gmane.org... >> >> I have to second this. I tried to report this outage several times >> last >> week - on IRC, on this mailing list, and on Bugzilla. All reports >> -- NOT >> COMPLAINTS, JUST REPORTS -- were met with absolute silence. > > Two updates on this. > > 1) Brion did respond to the Bugzilla report (albeit two+ days after > it was > posted), which I overlooked when posting earlier. He said "The box > they > were running on (srv31) is dead. We'll reassign them over the > weekend if we > can't bring the box back up." > > 2) Within the last hour, the server log at > http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that > Rob found > and fixed the cause of srv31 (and srv32-34) being down -- a circuit > breaker > was tripped in the data center. > > Russ > > > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
"Russell Blau" wrote in message news:gnuacf$hf...@ger.gmane.org... > > I have to second this. I tried to report this outage several times last > week - on IRC, on this mailing list, and on Bugzilla. All reports -- NOT > COMPLAINTS, JUST REPORTS -- were met with absolute silence. Two updates on this. 1) Brion did respond to the Bugzilla report (albeit two+ days after it was posted), which I overlooked when posting earlier. He said "The box they were running on (srv31) is dead. We'll reassign them over the weekend if we can't bring the box back up." 2) Within the last hour, the server log at http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker was tripped in the data center. Russ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
"Lars Aronsson" wrote in message news:pine.lnx.4.64.0902231202140.1...@localhost.localdomain... > > However, quite independent of your development work, the current > system for dumps seems to have stopped on February 12. That's the > impression I get from looking at > http://download.wikimedia.org/backup-index.html > > Despite all its shortcomings (3-4 weeks between dumps, no history > dumps for en.wikipedia), the current dump system is very useful. > What's not useful is that it was out of service from July to > October 2008 and now again appears to be broken since February 12. > ... > Still today, February 23, no explanation has been posted on that > dump website or on these mailing lists. That's the real surprise. I have to second this. I tried to report this outage several times last week - on IRC, on this mailing list, and on Bugzilla. All reports -- NOT COMPLAINTS, JUST REPORTS -- were met with absolute silence. I fully understand that time and resources are limited, and not everything can be fixed immediately, but at least some acknowledgement of the reports would be appreciated. It is extremely disheartening to members of the user community of what is supposed to be a collaborative project when attempts to contribute by reporting a service outage are ignored. Russ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Ariel T. Glenn wrote: >> The reason these dumps are not rewritten more efficiently is >> that this job was handed to me (at my request) and I have not >> been able to get to it, even though it is the first thing on my >> list for development work. >> [...] >> The in-office needs that I am also responsible for take >> virtually all of my time. Perhaps they shouldn't, but that is >> how it has worked out. Hi Ariel, I hope you find the time and peace you need for this development. It might be a bit worrying if this was handed to you (by Brion? when?) without also handing you the necessary resources. But the internal organization there is not my task. However, quite independent of your development work, the current system for dumps seems to have stopped on February 12. That's the impression I get from looking at http://download.wikimedia.org/backup-index.html Despite all its shortcomings (3-4 weeks between dumps, no history dumps for en.wikipedia), the current dump system is very useful. What's not useful is that it was out of service from July to October 2008 and now again appears to be broken since February 12. Certainly, things do fail. But when they do, and I ask about this on #wikimedia-tech on February 20, a week after things stopped, I don't expect Brion to say "oops". I want him to know about it 12 hours after it happend and to have a plan. Apparently (I'm just guessing from what I hear), serv31 is broken and serv31 was not in the Nagios watchdog system. OK, will this be fixed? When? Still today, February 23, no explanation has been posted on that dump website or on these mailing lists. That's the real surprise. I have other issues I want to deal with: mapping extensions, new visionsary solutions, new ways to involve new people in creating free knowledge. But if basic planning, routines and resource allocation don't work inside the WMF, then we have to start with the basics. What's wrong there? How can it be helped? -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
2009/2/22 Robert Ullmann : > Want everyone to just dynamically crawl the live DB, with whatever > screwy lousy inefficiency? FIne, just continue as you are, where that > is all that can be relied upon! Even if you had the dumps, you have another problem: They're incredibly big and so a bit difficult to parse. So, a small suggestion if the dumps will ever be workin' again: Split the history and current db stuff by alphabet, please. Marco PS: Are there any measurements what traffic is generated by ppl who download the dumps? Have there been any attempts to distribute them via BitTorrent? -- VMSoft GbR Nabburger Str. 15 81737 München Geschätsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
yep, http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ +) 2009/2/23 Alex : > Ariel T. Glenn wrote: >> The reason these dumps are not rewritten more efficiently is that this >> job was handed to me (at my request) and I have not been able to get to >> it, even though it is the first thing on my list for development work. >> So, if there are going to be rants, they can be directed at me, not at >> the whole team. >> >> The work was started already by a volunteer. As I am the blocking >> factor, someone else should probably take it on and get it done, though >> it will make me sad. Brion discussed this with me about a week and a >> half ago and I still wanted to keep it then but it doesn't make sense. >> The in-office needs that I am also responsible for take virtually all of >> my time. Perhaps they shouldn't, but that is how it has worked out. >> >> So, I am very sorry for having needlessly held things up. (I also have >> aa crawler that requests pages changed since the latest xml dump, so >> that projects I am on can keep a current xml file; we've been running >> that way for at least a year.) >> > > Is the source for the new dump system on SVN somewhere? > > -- > Alex (wikipedia:en:User:Mr.Z-man) > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ] ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Ariel T. Glenn wrote: > The reason these dumps are not rewritten more efficiently is that this > job was handed to me (at my request) and I have not been able to get to > it, even though it is the first thing on my list for development work. > So, if there are going to be rants, they can be directed at me, not at > the whole team. > > The work was started already by a volunteer. As I am the blocking > factor, someone else should probably take it on and get it done, though > it will make me sad. Brion discussed this with me about a week and a > half ago and I still wanted to keep it then but it doesn't make sense. > The in-office needs that I am also responsible for take virtually all of > my time. Perhaps they shouldn't, but that is how it has worked out. > > So, I am very sorry for having needlessly held things up. (I also have > aa crawler that requests pages changed since the latest xml dump, so > that projects I am on can keep a current xml file; we've been running > that way for at least a year.) > Is the source for the new dump system on SVN somewhere? -- Alex (wikipedia:en:User:Mr.Z-man) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
The reason these dumps are not rewritten more efficiently is that this job was handed to me (at my request) and I have not been able to get to it, even though it is the first thing on my list for development work. So, if there are going to be rants, they can be directed at me, not at the whole team. The work was started already by a volunteer. As I am the blocking factor, someone else should probably take it on and get it done, though it will make me sad. Brion discussed this with me about a week and a half ago and I still wanted to keep it then but it doesn't make sense. The in-office needs that I am also responsible for take virtually all of my time. Perhaps they shouldn't, but that is how it has worked out. So, I am very sorry for having needlessly held things up. (I also have aa crawler that requests pages changed since the latest xml dump, so that projects I am on can keep a current xml file; we've been running that way for at least a year.) Ariel Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/η Gerard Meijssen έγραψε: > Hoi, > There have been previous offers for developer time and for hardware... > Thanks, >GerardM > > 2009/2/23 Platonides > > > Robert Ullmann wrote: > > > Hi, > > > > > > Maybe I should offer a constructive suggestion? > > > > They are better than rants :) > > > > > Clearly, trying to do these dumps (particularly "history" dumps) as it > > > is being done from the servers is proving hard to manage > > > > > > I also realize that you can't just put the set of daily > > > permanent-media backups on line, as they contain lots of user info, > > > plus deleted and oversighted revs, etc. > > > > > > But would it be possible to put each backup disc (before sending one > > > of the several copies off to its secure storage) in a machine that > > > would filter all the content into a public file (or files)? Then > > > someone else could download each disc (i.e. a 10-15 GB chunk of > > > updates) and sort it into the useful files for general download? > > > > I don't think they move backup copies off to secure storage. They have > > the db replicated and the backup discs would be copies of that same > > dumps. (Some sysadmin to confirm?) > > > > > Then someone can produce a current (for example) English 'pedia XML > > > file; and with more work the cumulative history files (if we want that > > > as one file). > > > > > > There would be delays, each of your permanent media backup discs has > > > to be (probably manually, but changers are available) loaded on the > > > "filter" system, and I don't know how many discs WMF generates per > > > day. (;-) and then it has to filter all the revision data etc. But it > > > still would easily be available for others in 48-72 hours, which beats > > > the present ~6 weeks when the dumps are working. > > > > > > No shortage of people with a box or two and any number of Tbyte hard > > > drives that might be willing to help, if they can get the raw backups. > > > > The problem is that WMF can't provide that raw unfiltered information. > > Perhaps you could donate a box on the condition that it could only be > > used for dump processing, but giving out unfiltered data would be too > > risky. > > > > > > > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Robert Ullmann: > What is with this? wrong list. the Foundation needs to allocate the resources to fix dumps. it hasn't done so, therefore dumps are still broken. perhaps you might ask the Foundation why dumps have such a low priority. - river. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (HP-UX) iEYEARECAAYFAkmiH1cACgkQIXd7fCuc5vIVyQCfcWBApKVmN/dFW491kjXdqooN xJgAniehF8+z4EQyh12mJH9vbgOS9Rpz =zVS9 -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Hoi, There have been previous offers for developer time and for hardware... Thanks, GerardM 2009/2/23 Platonides > Robert Ullmann wrote: > > Hi, > > > > Maybe I should offer a constructive suggestion? > > They are better than rants :) > > > Clearly, trying to do these dumps (particularly "history" dumps) as it > > is being done from the servers is proving hard to manage > > > > I also realize that you can't just put the set of daily > > permanent-media backups on line, as they contain lots of user info, > > plus deleted and oversighted revs, etc. > > > > But would it be possible to put each backup disc (before sending one > > of the several copies off to its secure storage) in a machine that > > would filter all the content into a public file (or files)? Then > > someone else could download each disc (i.e. a 10-15 GB chunk of > > updates) and sort it into the useful files for general download? > > I don't think they move backup copies off to secure storage. They have > the db replicated and the backup discs would be copies of that same > dumps. (Some sysadmin to confirm?) > > > Then someone can produce a current (for example) English 'pedia XML > > file; and with more work the cumulative history files (if we want that > > as one file). > > > > There would be delays, each of your permanent media backup discs has > > to be (probably manually, but changers are available) loaded on the > > "filter" system, and I don't know how many discs WMF generates per > > day. (;-) and then it has to filter all the revision data etc. But it > > still would easily be available for others in 48-72 hours, which beats > > the present ~6 weeks when the dumps are working. > > > > No shortage of people with a box or two and any number of Tbyte hard > > drives that might be willing to help, if they can get the raw backups. > > The problem is that WMF can't provide that raw unfiltered information. > Perhaps you could donate a box on the condition that it could only be > used for dump processing, but giving out unfiltered data would be too > risky. > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Robert Ullmann wrote: > Hi, > > Maybe I should offer a constructive suggestion? They are better than rants :) > Clearly, trying to do these dumps (particularly "history" dumps) as it > is being done from the servers is proving hard to manage > > I also realize that you can't just put the set of daily > permanent-media backups on line, as they contain lots of user info, > plus deleted and oversighted revs, etc. > > But would it be possible to put each backup disc (before sending one > of the several copies off to its secure storage) in a machine that > would filter all the content into a public file (or files)? Then > someone else could download each disc (i.e. a 10-15 GB chunk of > updates) and sort it into the useful files for general download? I don't think they move backup copies off to secure storage. They have the db replicated and the backup discs would be copies of that same dumps. (Some sysadmin to confirm?) > Then someone can produce a current (for example) English 'pedia XML > file; and with more work the cumulative history files (if we want that > as one file). > > There would be delays, each of your permanent media backup discs has > to be (probably manually, but changers are available) loaded on the > "filter" system, and I don't know how many discs WMF generates per > day. (;-) and then it has to filter all the revision data etc. But it > still would easily be available for others in 48-72 hours, which beats > the present ~6 weeks when the dumps are working. > > No shortage of people with a box or two and any number of Tbyte hard > drives that might be willing to help, if they can get the raw backups. The problem is that WMF can't provide that raw unfiltered information. Perhaps you could donate a box on the condition that it could only be used for dump processing, but giving out unfiltered data would be too risky. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
Hi, Maybe I should offer a constructive suggestion? Clearly, trying to do these dumps (particularly "history" dumps) as it is being done from the servers is proving hard to manage I also realize that you can't just put the set of daily permanent-media backups on line, as they contain lots of user info, plus deleted and oversighted revs, etc. But would it be possible to put each backup disc (before sending one of the several copies off to its secure storage) in a machine that would filter all the content into a public file (or files)? Then someone else could download each disc (i.e. a 10-15 GB chunk of updates) and sort it into the useful files for general download? Then someone can produce a current (for example) English 'pedia XML file; and with more work the cumulative history files (if we want that as one file). There would be delays, each of your permanent media backup discs has to be (probably manually, but changers are available) loaded on the "filter" system, and I don't know how many discs WMF generates per day. (;-) and then it has to filter all the revision data etc. But it still would easily be available for others in 48-72 hours, which beats the present ~6 weeks when the dumps are working. No shortage of people with a box or two and any number of Tbyte hard drives that might be willing to help, if they can get the raw backups. Best, Robert ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
What is with this? Why are the XML dumps (the primary product of the projects: re-usable content) the absolute effing lowest possible effing priority? Why? I just finished (I thought) putting together some new software to update iwikis on the wiktionaries. It is set up to read the "langlinks" and "all-titles" part of the dumps. Just as I do that, the dumps fall down. Again. And no-one cares one whit; not even a reply here. (The bug was replied to after 4 days, and *might* be fixed presently, after 9 days?) My course of action now is to write new code to use thousands of API calls to get the information, albeit as efficiently as I can. When I do that, the chance that it will ever go back to using the dumps is a very close approximation to zero. After all, it will work somewhat better that way. Other people, *many*, *many*, other people are being *forced* to do the same, to maintain their apps and functions based on the WMF data. And there is no chance in hell they will go back to the dump "service" either. Brion, Tim, et al: you are worried about overall server load? Get the dumps working. This morning. And make it crystal clear that they will not break, and you will be checking them n times a day and they can be utterly, totally, absolutely relied upon. It's like that. People will use what *works*. Want people to use the dumps? Make them WORK. Want everyone to just dynamically crawl the live DB, with whatever screwy lousy inefficiency? FIne, just continue as you are, where that is all that can be relied upon! Look at the other threads: people asking if they can crawl the English WP at one per second, or maybe what? Is that what you want? That is what you are telling people to do, when the dump "service" says "2009-02-12 06:52:16 pswiki: Dump in progress" at the top on the 22nd of February. FYI for all others: if you want content dumps of the English Wiktionary, they are available in the usual XML format at http://devtionary.info/w/dump/xmlu/ at ~ 09;00 UTC. Every day. With my best regards, Robert On Tue, Feb 17, 2009 at 7:35 PM, Russell Blau wrote: > "Andreas Meier" wrote in message > news:4997d645.8050...@gmx.de... >> Hello, >> >> the current dump building seem to be dead and perhaps should be killed >> by hand. >> > > Reported: https://bugzilla.wikimedia.org/show_bug.cgi?id=17535 > > > > > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Dump processes seem to be dead
"Andreas Meier" wrote in message news:4997d645.8050...@gmx.de... > Hello, > > the current dump building seem to be dead and perhaps should be killed > by hand. > Reported: https://bugzilla.wikimedia.org/show_bug.cgi?id=17535 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Dump processes seem to be dead
Hello, the current dump building seem to be dead and perhaps should be killed by hand. Best regards Andim ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l