Re: [Wikitech-l] Dump processes seem to be dead

2009-02-26 Thread Aryeh Gregor
On Thu, Feb 26, 2009 at 4:48 PM, Platonides  wrote:
> Not only do you need to keep them in the same block. You also need to
> keep them inside the compression window. Unless you are going to reorder
> those 1M revisions to keep revisions to the same article together.

He already said that should be done (each block clustered by page id).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-26 Thread Platonides
Robert Ullmann wrote:
> look at the first three digits of the revid, when they are the same,
> they would be in the same "block" (this is assuming 1M revs/block as I
> suggested). You can check any title you like (remember _ for space,
> and % escapes for a lot of characters, but a good browser will do that
> for you in a lot of cases) Since the majority of edits are for a
> minority of titles (some version of the 80/20 rule applies), most
> edits/revisions will be in the same block as a number of others for
> that page.

Not only do you need to keep them in the same block. You also need to
keep them inside the compression window. Unless you are going to reorder
those 1M revisions to keep revisions to the same article together.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Robert Ullmann
Hi,

On Thu, Feb 26, 2009 at 2:29 AM, Andrew Garrett  wrote:
> On Thu, Feb 26, 2009 at 5:08 AM, John Doe  wrote:
>> But server space saved by compression would be would be compensated by the
>> stability, and flexibility provided by this method. this would allow what
>> ever server is controlling the dump process to designate and delegate
>> parallel processes for the same dump.
>
> Not nearly -- we're talking about a 100-fold decrease in compression
> ratio if we don't compress revisions of the same page adjacent to one
> another.
>
> --
> Andrew Garrett

No, not nearly that bad. Keep in mind that ~10x of the compression is
just from having English text and repeated XML tags, etc. (Note the
compression ratio of the all-articles dump, which has only one
revision of each article.)

If the revisions in each "block" are sorted by pageid, so that the
revs of the same article are together, you'll get a very large part of
the other 10x factor. Revisions to pages tend to cluster in time
(think edits and reverts :-) as one or more people work on an article,
or it is of news interest (see "Slumdog Millionaire" ;-) or whatever.
You can see this for any given article, like this:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimit=max&titles=Snail

look at the first three digits of the revid, when they are the same,
they would be in the same "block" (this is assuming 1M revs/block as I
suggested). You can check any title you like (remember _ for space,
and % escapes for a lot of characters, but a good browser will do that
for you in a lot of cases) Since the majority of edits are for a
minority of titles (some version of the 80/20 rule applies), most
edits/revisions will be in the same block as a number of others for
that page.

So we will get most, but not all, of the other 10X compression ratio.

But even if the compressed blocks are (say) 20% bigger, the win is
that once they are some weeks old, they NEVER need to be re-built.
Each dump (which should then be about weekly, with the same compute
resource, as the queue runs faster ;-) need only build or re-build a
few blocks. (And there is no need at all to parallelize any given
dump, just run 3-5 different ones in parallel as now.)

best, Robert

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Andrew Garrett
On Thu, Feb 26, 2009 at 5:08 AM, John Doe  wrote:
> But server space saved by compression would be would be compensated by the
> stability, and flexibility provided by this method. this would allow what
> ever server is controlling the dump process to designate and delegate
> parallel processes for the same dump.

Not nearly -- we're talking about a 100-fold decrease in compression
ratio if we don't compress revisions of the same page adjacent to one
another.

--
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Felipe Ortega
--- El mié, 25/2/09, Robert Ullmann  escribió:

> De: Robert Ullmann 
> Asunto: Re: [Wikitech-l] Dump processes seem to be dead
> Para: "Wikimedia developers" 
> Fecha: miércoles, 25 febrero, 2009 2:09
> you
> yourself suggested page id.
> 
> I suggest the history be partitioned into
> "blocks" by *revision ID*

I've checked some alternatives to slice the huge dump files in chunks with a 
more manageable size. I first thought about dividing the blocks by rev_id, like 
you suggest. Then, I realized that it can pose some problems for parsers 
recovering information, since revisions corresponding to the same page may fall 
in different dump files.

Once you have surpassed the page_id tag, you cannot remember it if the process 
stops due to some error, unless you save breakpoint information to recover it 
later on, when you restart the process again.

Partitioning by page_id, you can maintain all revs of the same page in the same 
block, while you don't disturb algorithms looking for individual revisions.

Yes, the chunks would be slightly bigger, but the difference is not that much 
with either 7zip or bzip2, and you favor simplicity of recovering tools.

Best,

F.

> 
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


  

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Platonides
Marco Schuster wrote:
> Another idea: If $revision is
> deleted/oversighted/whateverhowmadeinvisible, then find out the block
> ID for the dump so that only this specific block needs to be
> re-created in next dump run. Or, better: do not recreate the dump
> block, but only remove the offending revision(s) from it. Shoulda save
> a lot of dump preparation time, IMO.
> 
> Marco

That's already done. New dumps insert the content from the previous ones
(when available, enwiki has a hard time on it).


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Marco Schuster
2009/2/25 John Doe :
> Id recommend either 10m or 10% of
> the database which ever is larger for new dumps to screen out a majority of
> the deletions. what are your thoughts on this process brion (and the rest of
> the tech team)?
Another idea: If $revision is
deleted/oversighted/whateverhowmadeinvisible, then find out the block
ID for the dump so that only this specific block needs to be
re-created in next dump run. Or, better: do not recreate the dump
block, but only remove the offending revision(s) from it. Shoulda save
a lot of dump preparation time, IMO.

Marco

-- 
VMSoft GbR
Nabburger Str. 15
81737 München
Geschätsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Thomas Dalton
2009/2/25 John Doe :
> But server space saved by compression would be would be compensated by the
> stability, and flexibility provided by this method.

True, I didn't mean to say it was a bad idea, I was just pointing out
one disadvantage you may not have considered.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Robert Rohde
On Tue, Feb 24, 2009 at 5:09 PM, Robert Ullmann  wrote:

> I suggest the history be partitioned into "blocks" by *revision ID*
>
> Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in
> "block 1", and so on. The English Wiktionary at the moment would have
> 7 blocks; the English Wikipedia would have 273.


Though there are arguments in favor of this, I think they are
outweighed by the fact that one would need to go through every block
in order to reconstruct the history of even a single page.  In my
opinion partitioning on page id is a much better idea since it would
keep each page's history in a single place.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread John Doe
But server space saved by compression would be would be compensated by the
stability, and flexibility provided by this method. this would allow what
ever server is controlling the dump process to designate and delegate
parallel processes for the same dump. so block 1 could be on server 1 and
block 2 could be on server 3. that would give the flexibility to use as many
servers as are available for this task more efficiently. if block 200 of
en.wp breaks for some reason you dont have to rebuild the previous 199
blocks you can just delegate a server to rebuild that single block. that
would allow the dump process to be a little more crash friendly (even though
I know we dont want to admit crashes happen :) ) this also enables the dump
time in future dumps to be cut drasticlly. Id recommend either 10m or 10% of
the database which ever is larger for new dumps to screen out a majority of
the deletions. what are your thoughts on this process brion (and the rest of
the tech team)?

Betacommand

On Wed, Feb 25, 2009 at 9:00 AM, Thomas Dalton wrote:

> 2009/2/25 Robert Ullmann :
> > I suggest the history be partitioned into "blocks" by *revision ID*
> >
> > Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in
> > "block 1", and so on. The English Wiktionary at the moment would have
> > 7 blocks; the English Wikipedia would have 273.
>
> One problem with that is that you won't get such good compression
> ratios. Most of the revisions of a single article are very similar to
> the revisions before and after it, so they compress down very small.
> If you break up the articles between different blocks you don't get
> that advantage (at least, not to the same extent).
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Thomas Dalton
2009/2/25 Robert Ullmann :
> I suggest the history be partitioned into "blocks" by *revision ID*
>
> Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in
> "block 1", and so on. The English Wiktionary at the moment would have
> 7 blocks; the English Wikipedia would have 273.

One problem with that is that you won't get such good compression
ratios. Most of the revisions of a single article are very similar to
the revisions before and after it, so they compress down very small.
If you break up the articles between different blocks you don't get
that advantage (at least, not to the same extent).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-25 Thread Mark (Markie)
afaik there are "hands" in amsterdam that can be called upon to do stuff as
necessary in the centre like any other hosting customer, but the need is not
quite of the same level as tampa due to size, servers there etc.  seoul no
longer operates so this is not an issue.

regards

mark

On Tue, Feb 24, 2009 at 2:55 PM, Gerard Meijssen
wrote:

> Hoi,
> Is there also a "Rob" in Amsterdam and Seoul ?
> Thanks,
>   GerardM
>
> 2009/2/24 Aryeh Gregor
> <
> simetrical%2bwikil...@gmail.com >
> >
>
> > On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton 
> > wrote:
> > > Is there anyone within minutes of the servers at all times? Aren't
> > > they at a remote data centre?
> >
> > Isn't Rob on-site?
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-24 Thread Robert Ullmann
> The worry bit is that it seems srv136 will now work as apache.
> So, where will dumps be done?

I'm not sure where (or if it has changed), but they are running now  (:-)

To Ariel Glenn:

On getting them to work better in the future, this is what I would suggest:

First, note that everything except the "all history" dumps presents no
problem. It isn't perfect, but it is workable. The biggest "all pages
current" dump is enwiki, which takes about a day and a half, and the
compressed output file (bz2) still fits neatly on a DVD.

As to the history files, these are the problem; each contains all of
the preceding history and they just grow and grow. They must be
partitioned somehow. Suggestions have been made concerning
alphabetical partitions (very traditional for encyclopaedias ;-); you
yourself suggested page id.

I suggest the history be partitioned into "blocks" by *revision ID*

Like this: revision IDs (0)-999,999 go in "block 0", 1M to 2M-1 in
"block 1", and so on. The English Wiktionary at the moment would have
7 blocks; the English Wikipedia would have 273.

The dumps would continue as now up to "all pages current", including
the split-stub dump for the history (very important, as it provides
the "snapshot" of the DB state). But then when it gets to history, it
re-builds the last block done (possibly completing it), and then
writes 0-n new ones as needed.

Note that (to pick a random number) "block 71" of the enwiki defined
this way *has not changed* in a long time; only the current block(s)
need to be (re-)written. The history stays the same. (Of course?!)

If someone somewhere needs a copy of the wiki with all history as of a
given date, they can start with the split-stub for that date and read
in all the required blocks. But that isn't your problem any more. (;-)
They can do that with their disk and servers.

It would probably be best to still sort by page-id order within each
block, as they will compress much better that way.

One reason to rebuild the last block (or two) is to filter out deleted
and oversighted revisions. Deleted and oversighted revisions older
than some specific time (a small number of weeks) would remain. But
note that that is true *anyway*, as someone can always look at a
3-month old dump under any method.

With my best regards,
Robert

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-24 Thread Gerard Meijssen
Hoi,
Is there also a "Rob" in Amsterdam and Seoul ?
Thanks,
   GerardM

2009/2/24 Aryeh Gregor

>

> On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton 
> wrote:
> > Is there anyone within minutes of the servers at all times? Aren't
> > they at a remote data centre?
>
> Isn't Rob on-site?
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-24 Thread Thomas Dalton
2009/2/24 Aryeh Gregor :
> On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton  
> wrote:
>> Is there anyone within minutes of the servers at all times? Aren't
>> they at a remote data centre?
>
> Isn't Rob on-site?

He's based somewhere near the data centre, but I'm not sure he's
actually there unless there is something which needs his attention.
He's certainly not there 24/7 (regrettably, WMF is still using human
sysadmins...).

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-24 Thread Aryeh Gregor
On Tue, Feb 24, 2009 at 9:42 AM, Thomas Dalton  wrote:
> Is there anyone within minutes of the servers at all times? Aren't
> they at a remote data centre?

Isn't Rob on-site?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-24 Thread Thomas Dalton
2009/2/24 Robert Ullmann :
> When a server is reported down (in this case hard; won't reply to
> ping) it should be physically looked at within minutes.

Is there anyone within minutes of the servers at all times? Aren't
they at a remote data centre?

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-24 Thread Platonides
Robert Ullmann wrote:
> All servers should be monitored, on several levels (ping, various
> queries, checking processes)

Nagios should have been monitoring them.


> Someone should be "watching" the monitor 24x7. (being right there, or
> by SMS, whatever ;)

Don't know if there can be a nagios "silent" failure, where it doesn't
get disconnected from irc.


> When restarted, the things it was doing should be restarted (this has
> not been done yet at this writing).

The worry bit is that it seems srv136 will now work as apache.
So, where will dumps be done?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Robert Ullmann
Let me ask a separate question (Ariel may be interested in this):

What if we took the regular permanent media backups, and WMF filtered
them in house just to remove the classified stuff (;-), and then put
them somewhere where others could convert them to the desired
format(s)? (Build all-history files, whatever.)

What is the standard backup procedure?

(I ask as I haven't seen any description or reference to it ... :-)

Robert

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Robert Ullmann
On Tue, Feb 24, 2009 at 6:49 AM, Andrew Garrett  wrote:
> On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann  wrote:
>> Really? I mean is this for real?
>>
>> The sequence ought to be something like: breaker trips, monitor shows
>> within a minute or two that 4 servers are offline, and not scheduled
>> to be. In the next 5 minutes someone looks at the server(s), notes
>> that there is no AC power, walks directly to the panel and resets the
>> breaker. How is this *not* done? I'm sorry, I just don't get it. I've
>> run data centres, and it just is not possible to have servers down for
>> AC power for more than a few minutes unless there is a fault one can't
>> locate. (Or grid down, and running a subset on the generators ;-)
>>
>> Can someone explain all this? Is the whole thing just completely
>> beyond the resource available to manage it?
>
> Constructive suggestions for improvement are far more welcome than
> complaints and outrage.
>
> If you have no suggestions for improvement, it is perhaps more prudent
> to express concern that dumps are not working and to wait for a
> response. This is admittedly less fun than piecing together
> information and "lining up" those responsible for something not being
> operational.

Andrew: this is NOT FUN AT ALL. Do you think it is "fun" to have to
complain bitterly and repeatedly because simply reporting
critical-down problems elicits little or no reply and no corrective
action for days and weeks? Fun? Fun?

Okay, I'll put it this way: the following should be done:

All servers should be monitored, on several levels (ping, various
queries, checking processes)

Someone should be "watching" the monitor 24x7. (being right there, or
by SMS, whatever ;)

When a server is reported down (in this case hard; won't reply to
ping) it should be physically looked at within minutes.

If it has no AC power, the circuit breaker is the first thing to check.

When restarted, the things it was doing should be restarted (this has
not been done yet at this writing).

Now I can say these things as "constructive suggestions", but are they
are not of course: they are fundamental operational procedure for a
data centre. Please explain to me why I should have to "suggest" them?
Eh? I am confused (seriously! I am not being snarky here). What is
going on?

best,
Robert

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Andrew Garrett
On Tue, Feb 24, 2009 at 1:07 PM, Robert Ullmann  wrote:
> Really? I mean is this for real?
>
> The sequence ought to be something like: breaker trips, monitor shows
> within a minute or two that 4 servers are offline, and not scheduled
> to be. In the next 5 minutes someone looks at the server(s), notes
> that there is no AC power, walks directly to the panel and resets the
> breaker. How is this *not* done? I'm sorry, I just don't get it. I've
> run data centres, and it just is not possible to have servers down for
> AC power for more than a few minutes unless there is a fault one can't
> locate. (Or grid down, and running a subset on the generators ;-)
>
> Can someone explain all this? Is the whole thing just completely
> beyond the resource available to manage it?

Constructive suggestions for improvement are far more welcome than
complaints and outrage.

If you have no suggestions for improvement, it is perhaps more prudent
to express concern that dumps are not working and to wait for a
response. This is admittedly less fun than piecing together
information and "lining up" those responsible for something not being
operational.

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Robert Ullmann
Hmm:

On Mon, Feb 23, 2009 at 9:04 PM, Russell Blau  wrote:

> 2)  Within the last hour, the server log at
> http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found
> and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker
> was tripped in the data center.

So we conclude that

Feb 12th: a breaker trips, taking four servers offline

(8 days go by, with a number of reports)

Feb 20th: it is noted that srv31 is down, (noted that AC is off?)

(3 days go by)

Feb 23rd: the tripped breaker is found, srv31 restarted (and 8+ hours
later, the dumps have not resumed)

Really? I mean is this for real?

The sequence ought to be something like: breaker trips, monitor shows
within a minute or two that 4 servers are offline, and not scheduled
to be. In the next 5 minutes someone looks at the server(s), notes
that there is no AC power, walks directly to the panel and resets the
breaker. How is this *not* done? I'm sorry, I just don't get it. I've
run data centres, and it just is not possible to have servers down for
AC power for more than a few minutes unless there is a fault one can't
locate. (Or grid down, and running a subset on the generators ;-)

Can someone explain all this? Is the whole thing just completely
beyond the resource available to manage it?

Best regards,
Robert

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Ángel
Robert Rohde wrote:
> The largest gains are almost certainly going to be in parallelization
> though.  A single monolithic dumper is impractical for enwiki.
> 
> -Robert Rohde

Using dumps compressed per blocks, as the ones I used for
http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html
would allow several processes/computers to write the same dump on
different offsets and reading from the last one on different positions
as well.

As sharing a transaction between different servers would be tricky, they
should probably dump from the previously dumped page.sql.gz


Patches on bugs 16082 and 16176 to add Export features are
awaiting review

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Brion Vibber
On 2/23/09 12:13 PM, Ariel T. Glenn wrote:
> I asked for it, and that's why it was assigned to me.  I should have
> recognized much sooner that I could not actually get it done and should
> have brought this to Brion's attention instead of continuing to hang on
> to it after he brought it to my attention.

I've been needing to reprioritize resources for this for a while; all of 
us having many other things to do at the same time and lots of folks 
being out sick during cold/flu season may not sound like a good excuse 
for this dragging on longer than I'd like the last few weeks, but I'm 
afraid it's the best I can offer at the moment.

Anyway, rest assured that this remains very much on my mind -- we 
haven't forgotten that the current dump process sucks and needs to be 
fixed up.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Brion Vibber
On 2/23/09 3:08 AM, Marco Schuster wrote:
> Even if you had the dumps, you have another problem: They're
> incredibly big and so a bit difficult to parse. So, a small suggestion
> if the dumps will ever be workin' again: Split the history and current
> db stuff by alphabet, please.

Define alphabet -- how should Chinese and Japanese texts be broken up?

We're much more likely to break them up simply by page ID.

> PS: Are there any measurements what traffic is generated by ppl who
> download the dumps?

Not currently.

> Have there been any attempts to distribute them
> via BitTorrent?

By third parties, with AFAIK very little usage.

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Ariel T. Glenn
Στις 23-02-2009, ημέρα Δευ, και ώρα 19:02 +, ο/η Thomas Dalton
έγραψε:
> 2009/2/23 Ariel T. Glenn :
> > The reason these dumps are not rewritten more efficiently is that this
> > job was handed to me (at my request) and I have not been able to get to
> > it, even though it is the first thing on my list for development work.
> > So, if there are going to be rants, they can be directed at me, not at
> > the whole team.
> >
> > The work was started already by a volunteer.  As I am the blocking
> > factor, someone else should probably take it on and get it done, though
> > it will make me sad.  Brion discussed this with me about a week and a
> > half ago and I still wanted to keep it then but it doesn't make sense.
> > The in-office needs that I am also responsible for take virtually all of
> > my time.  Perhaps they shouldn't, but that is how it has worked out.
> 
> In that case, it seems the mistake was assigning what should have been
> a top-priority task to someone that couldn't actually make it their
> top priority due to other commitments. If someone is unable to
> guarantee that they'll have time to do something, they shouldn't be
> assigned something so time critical.

I asked for it, and that's why it was assigned to me.  I should have
recognized much sooner that I could not actually get it done and should
have brought this to Brion's attention instead of continuing to hang on
to it after he brought it to my attention.  

Ariel


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Robert Rohde
On Mon, Feb 23, 2009 at 11:08 AM, Alex  wrote:
> Most of that hasn't been touched in years, and it seems to be mainly a
> Python wrapper around the dump scripts in /phase3/maintenance/ which
> also don't seem to have had significant changes recently. Has anything
> been done recently (in a very broad sense of the word)? Or at least, has
> anything been written down about what the plans are?

In a "very broad sense" (and not directly connected to main problems),
I wrote a compressor [1] that converts full-text history dumps into an
"edit syntax" that provides ~95% compression on the larger dumps while
keeping it in a plain text format that could still be searched and
processed without needing a full decompression.

That's one of several ways to modify the way dump process operates in
order to make the output easier to work with (if it takes ~2 TB to
expand enwiki's full history, then that is not practical for most
users even if we solve the problem of generating it).  It is not
necessarily true that my specific technology is the right answer, but
various changes in formatting to aid distribution, generation, and use
are one of the areas that ought to be considered when reimplementing
the dump process.

The largest gains are almost certainly going to be in parallelization
though.  A single monolithic dumper is impractical for enwiki.

-Robert Rohde

[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/editsyntax/

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Alex
Most of that hasn't been touched in years, and it seems to be mainly a
Python wrapper around the dump scripts in /phase3/maintenance/ which
also don't seem to have had significant changes recently. Has anything
been done recently (in a very broad sense of the word)? Or at least, has
anything been written down about what the plans are?


Nicolas Dumazet wrote:
> yep, http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ +)
> 
> 2009/2/23 Alex :
>> Ariel T. Glenn wrote:
>>> The reason these dumps are not rewritten more efficiently is that this
>>> job was handed to me (at my request) and I have not been able to get to
>>> it, even though it is the first thing on my list for development work.
>>> So, if there are going to be rants, they can be directed at me, not at
>>> the whole team.
>>>
>>> The work was started already by a volunteer.  As I am the blocking
>>> factor, someone else should probably take it on and get it done, though
>>> it will make me sad.  Brion discussed this with me about a week and a
>>> half ago and I still wanted to keep it then but it doesn't make sense.
>>> The in-office needs that I am also responsible for take virtually all of
>>> my time.  Perhaps they shouldn't, but that is how it has worked out.
>>>
>>> So, I am very sorry for having needlessly held things up.  (I also have
>>> aa crawler that requests pages changed since the latest xml dump, so
>>> that projects I am on can keep a current xml file; we've been running
>>> that way for at least a year.)
>>>
>> Is the source for the new dump system on SVN somewhere?
>>
>> --
>> Alex (wikipedia:en:User:Mr.Z-man)
>>

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Thomas Dalton
2009/2/23 Ariel T. Glenn :
> The reason these dumps are not rewritten more efficiently is that this
> job was handed to me (at my request) and I have not been able to get to
> it, even though it is the first thing on my list for development work.
> So, if there are going to be rants, they can be directed at me, not at
> the whole team.
>
> The work was started already by a volunteer.  As I am the blocking
> factor, someone else should probably take it on and get it done, though
> it will make me sad.  Brion discussed this with me about a week and a
> half ago and I still wanted to keep it then but it doesn't make sense.
> The in-office needs that I am also responsible for take virtually all of
> my time.  Perhaps they shouldn't, but that is how it has worked out.

In that case, it seems the mistake was assigning what should have been
a top-priority task to someone that couldn't actually make it their
top priority due to other commitments. If someone is unable to
guarantee that they'll have time to do something, they shouldn't be
assigned something so time critical.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Christian Storm
Ariel,

Thank you for giving some insight into what has been going on behind  
the scenes.
I have a few questions that will hopefully get some answers to those  
of us eager to help out
in any way we can.

What are the planned code changes to speed the process up? Can we help  
this volunteer
with the coding or architectural decisions? How much time do they have  
to dedicate to it?  Some visibility
into the fix and timeline would benefit a lot of us.  It would also  
help us know how we can
help out!

Thanks again for shedding some light on the issue.

On Feb 22, 2009, at 8:12 PM, Ariel T. Glenn wrote:

> The reason these dumps are not rewritten more efficiently is that this
> job was handed to me (at my request) and I have not been able to get  
> to
> it, even though it is the first thing on my list for development work.
> So, if there are going to be rants, they can be directed at me, not at
> the whole team.
>
> The work was started already by a volunteer.  As I am the blocking
> factor, someone else should probably take it on and get it done,  
> though
> it will make me sad.  Brion discussed this with me about a week and a
> half ago and I still wanted to keep it then but it doesn't make sense.
> The in-office needs that I am also responsible for take virtually  
> all of
> my time.  Perhaps they shouldn't, but that is how it has worked out.
>
> So, I am very sorry for having needlessly held things up.  (I also  
> have
> aa crawler that requests pages changed since the latest xml dump, so
> that projects I am on can keep a current xml file; we've been running
> that way for at least a year.)
>
> Ariel
>
>
> Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/ 
> η Gerard Meijssen
> έγραψε:
>> Hoi,
>> There have been previous offers for developer time and for  
>> hardware...
>> Thanks,
>>   GerardM
>>
>> 2009/2/23 Platonides 
>>
>>> Robert Ullmann wrote:
 Hi,

 Maybe I should offer a constructive suggestion?
>>>
>>> They are better than rants :)
>>>
 Clearly, trying to do these dumps (particularly "history" dumps)  
 as it
 is being done from the servers is proving hard to manage

 I also realize that you can't just put the set of daily
 permanent-media backups on line, as they contain lots of user info,
 plus deleted and oversighted revs, etc.

 But would it be possible to put each backup disc (before sending  
 one
 of the several copies off to its secure storage) in a machine that
 would filter all the content into a public file (or files)? Then
 someone else could download each disc (i.e. a 10-15 GB chunk of
 updates) and sort it into the useful files for general download?
>>>
>>> I don't think they move backup copies off to secure storage. They  
>>> have
>>> the db replicated and the backup discs would be copies of that same
>>> dumps. (Some sysadmin to confirm?)
>>>
 Then someone can produce a current (for example) English 'pedia XML
 file; and with more work the cumulative history files (if we want  
 that
 as one file).

 There would be delays, each of your permanent media backup discs  
 has
 to be (probably manually, but changers are available) loaded on the
 "filter" system, and I don't know how many discs WMF generates per
 day. (;-) and then it has to filter all the revision data etc.  
 But it
 still would easily be available for others in 48-72 hours, which  
 beats
 the present ~6 weeks when the dumps are working.

 No shortage of people with a box or two and any number of Tbyte  
 hard
 drives that might be willing to help, if they can get the raw  
 backups.
>>>
>>> The problem is that WMF can't provide that raw unfiltered  
>>> information.
>>> Perhaps you could donate a box on the condition that it could only  
>>> be
>>> used for dump processing, but giving out unfiltered data would be  
>>> too
>>> risky.
>>>
>>>
>>>
>>> ___
>>> Wikitech-l mailing list
>>> Wikitech-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Christian Storm
Thanks for the update Russell!

On Feb 23, 2009, at 10:04 AM, Russell Blau wrote:

> "Russell Blau"  wrote in message
> news:gnuacf$hf...@ger.gmane.org...
>>
>> I have to second this.  I tried to report this outage several times  
>> last
>> week - on IRC, on this mailing list, and on Bugzilla.  All reports  
>> -- NOT
>> COMPLAINTS, JUST REPORTS -- were met with absolute silence.
>
> Two updates on this.
>
> 1)  Brion did respond to the Bugzilla report (albeit two+ days after  
> it was
> posted), which I overlooked when posting earlier.  He said "The box  
> they
> were running on (srv31) is dead. We'll reassign them over the  
> weekend if we
> can't bring the box back up."
>
> 2)  Within the last hour, the server log at
> http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that  
> Rob found
> and fixed the cause of srv31 (and srv32-34) being down -- a circuit  
> breaker
> was tripped in the data center.
>
> Russ
>
>
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Russell Blau
"Russell Blau"  wrote in message 
news:gnuacf$hf...@ger.gmane.org...
>
> I have to second this.  I tried to report this outage several times last 
> week - on IRC, on this mailing list, and on Bugzilla.  All reports -- NOT 
> COMPLAINTS, JUST REPORTS -- were met with absolute silence.

Two updates on this.

1)  Brion did respond to the Bugzilla report (albeit two+ days after it was 
posted), which I overlooked when posting earlier.  He said "The box they 
were running on (srv31) is dead. We'll reassign them over the weekend if we 
can't bring the box back up."

2)  Within the last hour, the server log at 
http://wikitech.wikimedia.org/wiki/Server_admin_log indicates that Rob found 
and fixed the cause of srv31 (and srv32-34) being down -- a circuit breaker 
was tripped in the data center.

Russ





___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Russell Blau
"Lars Aronsson"  wrote in message 
news:pine.lnx.4.64.0902231202140.1...@localhost.localdomain...
>
> However, quite independent of your development work, the current
> system for dumps seems to have stopped on February 12. That's the
> impression I get from looking at
> http://download.wikimedia.org/backup-index.html
>
> Despite all its shortcomings (3-4 weeks between dumps, no history
> dumps for en.wikipedia), the current dump system is very useful.
> What's not useful is that it was out of service from July to
> October 2008 and now again appears to be broken since February 12.
>
...
> Still today, February 23, no explanation has been posted on that
> dump website or on these mailing lists. That's the real surprise.

I have to second this.  I tried to report this outage several times last 
week - on IRC, on this mailing list, and on Bugzilla.  All reports -- NOT 
COMPLAINTS, JUST REPORTS -- were met with absolute silence.  I fully 
understand that time and resources are limited, and not everything can be 
fixed immediately, but at least some acknowledgement of the reports would be 
appreciated.  It is extremely disheartening to members of the user community 
of what is supposed to be a collaborative project when attempts to 
contribute by reporting a service outage are ignored.

Russ




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Lars Aronsson

Ariel T. Glenn wrote:

>> The reason these dumps are not rewritten more efficiently is 
>> that this job was handed to me (at my request) and I have not 
>> been able to get to it, even though it is the first thing on my 
>> list for development work.
>> [...]
>> The in-office needs that I am also responsible for take 
>> virtually all of my time.  Perhaps they shouldn't, but that is 
>> how it has worked out.

Hi Ariel, I hope you find the time and peace you need for this 
development. It might be a bit worrying if this was handed to you 
(by Brion? when?) without also handing you the necessary 
resources. But the internal organization there is not my task.

However, quite independent of your development work, the current 
system for dumps seems to have stopped on February 12. That's the 
impression I get from looking at 
http://download.wikimedia.org/backup-index.html

Despite all its shortcomings (3-4 weeks between dumps, no history 
dumps for en.wikipedia), the current dump system is very useful.  
What's not useful is that it was out of service from July to 
October 2008 and now again appears to be broken since February 12.

Certainly, things do fail. But when they do, and I ask about this 
on #wikimedia-tech on February 20, a week after things stopped, I 
don't expect Brion to say "oops".  I want him to know about it 12 
hours after it happend and to have a plan.  Apparently (I'm just 
guessing from what I hear), serv31 is broken and serv31 was not in 
the Nagios watchdog system.  OK, will this be fixed?  When?

Still today, February 23, no explanation has been posted on that 
dump website or on these mailing lists. That's the real surprise.

I have other issues I want to deal with: mapping extensions, new 
visionsary solutions, new ways to involve new people in creating 
free knowledge. But if basic planning, routines and resource 
allocation don't work inside the WMF, then we have to start with 
the basics.  What's wrong there?  How can it be helped?


-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Marco Schuster
2009/2/22 Robert Ullmann :
> Want everyone to just dynamically crawl the live DB, with whatever
> screwy lousy inefficiency? FIne, just continue as you are, where that
> is all that can be relied upon!

Even if you had the dumps, you have another problem: They're
incredibly big and so a bit difficult to parse. So, a small suggestion
if the dumps will ever be workin' again: Split the history and current
db stuff by alphabet, please.

Marco

PS: Are there any measurements what traffic is generated by ppl who
download the dumps? Have there been any attempts to distribute them
via BitTorrent?
-- 
VMSoft GbR
Nabburger Str. 15
81737 München
Geschätsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump processes seem to be dead

2009-02-23 Thread Nicolas Dumazet
yep, http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ +)

2009/2/23 Alex :
> Ariel T. Glenn wrote:
>> The reason these dumps are not rewritten more efficiently is that this
>> job was handed to me (at my request) and I have not been able to get to
>> it, even though it is the first thing on my list for development work.
>> So, if there are going to be rants, they can be directed at me, not at
>> the whole team.
>>
>> The work was started already by a volunteer.  As I am the blocking
>> factor, someone else should probably take it on and get it done, though
>> it will make me sad.  Brion discussed this with me about a week and a
>> half ago and I still wanted to keep it then but it doesn't make sense.
>> The in-office needs that I am also responsible for take virtually all of
>> my time.  Perhaps they shouldn't, but that is how it has worked out.
>>
>> So, I am very sorry for having needlessly held things up.  (I also have
>> aa crawler that requests pages changed since the latest xml dump, so
>> that projects I am on can keep a current xml file; we've been running
>> that way for at least a year.)
>>
>
> Is the source for the new dump system on SVN somewhere?
>
> --
> Alex (wikipedia:en:User:Mr.Z-man)
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Nicolas Dumazet — NicDumZ [ nɪk.d̪ymz ]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread Alex
Ariel T. Glenn wrote:
> The reason these dumps are not rewritten more efficiently is that this
> job was handed to me (at my request) and I have not been able to get to
> it, even though it is the first thing on my list for development work.
> So, if there are going to be rants, they can be directed at me, not at
> the whole team. 
> 
> The work was started already by a volunteer.  As I am the blocking
> factor, someone else should probably take it on and get it done, though
> it will make me sad.  Brion discussed this with me about a week and a
> half ago and I still wanted to keep it then but it doesn't make sense.
> The in-office needs that I am also responsible for take virtually all of
> my time.  Perhaps they shouldn't, but that is how it has worked out.  
> 
> So, I am very sorry for having needlessly held things up.  (I also have
> aa crawler that requests pages changed since the latest xml dump, so
> that projects I am on can keep a current xml file; we've been running
> that way for at least a year.)
> 

Is the source for the new dump system on SVN somewhere?

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread Ariel T. Glenn
The reason these dumps are not rewritten more efficiently is that this
job was handed to me (at my request) and I have not been able to get to
it, even though it is the first thing on my list for development work.
So, if there are going to be rants, they can be directed at me, not at
the whole team. 

The work was started already by a volunteer.  As I am the blocking
factor, someone else should probably take it on and get it done, though
it will make me sad.  Brion discussed this with me about a week and a
half ago and I still wanted to keep it then but it doesn't make sense.
The in-office needs that I am also responsible for take virtually all of
my time.  Perhaps they shouldn't, but that is how it has worked out.  

So, I am very sorry for having needlessly held things up.  (I also have
aa crawler that requests pages changed since the latest xml dump, so
that projects I am on can keep a current xml file; we've been running
that way for at least a year.)

Ariel


Στις 23-02-2009, ημέρα Δευ, και ώρα 00:37 +0100, ο/η Gerard Meijssen
έγραψε:
> Hoi,
> There have been previous offers for developer time and for hardware...
> Thanks,
>GerardM
> 
> 2009/2/23 Platonides 
> 
> > Robert Ullmann wrote:
> > > Hi,
> > >
> > > Maybe I should offer a constructive suggestion?
> >
> > They are better than rants :)
> >
> > > Clearly, trying to do these dumps (particularly "history" dumps) as it
> > > is being done from the servers is proving hard to manage
> > >
> > > I also realize that you can't just put the set of daily
> > > permanent-media backups on line, as they contain lots of user info,
> > > plus deleted and oversighted revs, etc.
> > >
> > > But would it be possible to put each backup disc (before sending one
> > > of the several copies off to its secure storage) in a machine that
> > > would filter all the content into a public file (or files)? Then
> > > someone else could download each disc (i.e. a 10-15 GB chunk of
> > > updates) and sort it into the useful files for general download?
> >
> > I don't think they move backup copies off to secure storage. They have
> > the db replicated and the backup discs would be copies of that same
> > dumps. (Some sysadmin to confirm?)
> >
> > > Then someone can produce a current (for example) English 'pedia XML
> > > file; and with more work the cumulative history files (if we want that
> > > as one file).
> > >
> > > There would be delays, each of your permanent media backup discs has
> > > to be (probably manually, but changers are available) loaded on the
> > > "filter" system, and I don't know how many discs WMF generates per
> > > day. (;-) and then it has to filter all the revision data etc. But it
> > > still would easily be available for others in 48-72 hours, which beats
> > > the present ~6 weeks when the dumps are working.
> > >
> > > No shortage of people with a box or two and any number of Tbyte hard
> > > drives that might be willing to help, if they can get the raw backups.
> >
> > The problem is that WMF can't provide that raw unfiltered information.
> > Perhaps you could donate a box on the condition that it could only be
> > used for dump processing, but giving out unfiltered data would be too
> > risky.
> >
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Robert Ullmann:
> What is with this?

wrong list.  the Foundation needs to allocate the resources to fix dumps.  it
hasn't done so, therefore dumps are still broken.  perhaps you might ask the
Foundation why dumps have such a low priority.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (HP-UX)

iEYEARECAAYFAkmiH1cACgkQIXd7fCuc5vIVyQCfcWBApKVmN/dFW491kjXdqooN
xJgAniehF8+z4EQyh12mJH9vbgOS9Rpz
=zVS9
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread Gerard Meijssen
Hoi,
There have been previous offers for developer time and for hardware...
Thanks,
   GerardM

2009/2/23 Platonides 

> Robert Ullmann wrote:
> > Hi,
> >
> > Maybe I should offer a constructive suggestion?
>
> They are better than rants :)
>
> > Clearly, trying to do these dumps (particularly "history" dumps) as it
> > is being done from the servers is proving hard to manage
> >
> > I also realize that you can't just put the set of daily
> > permanent-media backups on line, as they contain lots of user info,
> > plus deleted and oversighted revs, etc.
> >
> > But would it be possible to put each backup disc (before sending one
> > of the several copies off to its secure storage) in a machine that
> > would filter all the content into a public file (or files)? Then
> > someone else could download each disc (i.e. a 10-15 GB chunk of
> > updates) and sort it into the useful files for general download?
>
> I don't think they move backup copies off to secure storage. They have
> the db replicated and the backup discs would be copies of that same
> dumps. (Some sysadmin to confirm?)
>
> > Then someone can produce a current (for example) English 'pedia XML
> > file; and with more work the cumulative history files (if we want that
> > as one file).
> >
> > There would be delays, each of your permanent media backup discs has
> > to be (probably manually, but changers are available) loaded on the
> > "filter" system, and I don't know how many discs WMF generates per
> > day. (;-) and then it has to filter all the revision data etc. But it
> > still would easily be available for others in 48-72 hours, which beats
> > the present ~6 weeks when the dumps are working.
> >
> > No shortage of people with a box or two and any number of Tbyte hard
> > drives that might be willing to help, if they can get the raw backups.
>
> The problem is that WMF can't provide that raw unfiltered information.
> Perhaps you could donate a box on the condition that it could only be
> used for dump processing, but giving out unfiltered data would be too
> risky.
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread Platonides
Robert Ullmann wrote:
> Hi,
> 
> Maybe I should offer a constructive suggestion?

They are better than rants :)

> Clearly, trying to do these dumps (particularly "history" dumps) as it
> is being done from the servers is proving hard to manage
> 
> I also realize that you can't just put the set of daily
> permanent-media backups on line, as they contain lots of user info,
> plus deleted and oversighted revs, etc.
> 
> But would it be possible to put each backup disc (before sending one
> of the several copies off to its secure storage) in a machine that
> would filter all the content into a public file (or files)? Then
> someone else could download each disc (i.e. a 10-15 GB chunk of
> updates) and sort it into the useful files for general download?

I don't think they move backup copies off to secure storage. They have
the db replicated and the backup discs would be copies of that same
dumps. (Some sysadmin to confirm?)

> Then someone can produce a current (for example) English 'pedia XML
> file; and with more work the cumulative history files (if we want that
> as one file).
> 
> There would be delays, each of your permanent media backup discs has
> to be (probably manually, but changers are available) loaded on the
> "filter" system, and I don't know how many discs WMF generates per
> day. (;-) and then it has to filter all the revision data etc. But it
> still would easily be available for others in 48-72 hours, which beats
> the present ~6 weeks when the dumps are working.
> 
> No shortage of people with a box or two and any number of Tbyte hard
> drives that might be willing to help, if they can get the raw backups.

The problem is that WMF can't provide that raw unfiltered information.
Perhaps you could donate a box on the condition that it could only be
used for dump processing, but giving out unfiltered data would be too risky.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread Robert Ullmann
Hi,

Maybe I should offer a constructive suggestion?

Clearly, trying to do these dumps (particularly "history" dumps) as it
is being done from the servers is proving hard to manage

I also realize that you can't just put the set of daily
permanent-media backups on line, as they contain lots of user info,
plus deleted and oversighted revs, etc.

But would it be possible to put each backup disc (before sending one
of the several copies off to its secure storage) in a machine that
would filter all the content into a public file (or files)? Then
someone else could download each disc (i.e. a 10-15 GB chunk of
updates) and sort it into the useful files for general download?

Then someone can produce a current (for example) English 'pedia XML
file; and with more work the cumulative history files (if we want that
as one file).

There would be delays, each of your permanent media backup discs has
to be (probably manually, but changers are available) loaded on the
"filter" system, and I don't know how many discs WMF generates per
day. (;-) and then it has to filter all the revision data etc. But it
still would easily be available for others in 48-72 hours, which beats
the present ~6 weeks when the dumps are working.

No shortage of people with a box or two and any number of Tbyte hard
drives that might be willing to help, if they can get the raw backups.

Best,
Robert

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-22 Thread Robert Ullmann
What is with this? Why are the XML dumps (the primary product of the
projects: re-usable content) the absolute effing lowest possible
effing priority? Why?

I just finished (I thought) putting together some new software to
update iwikis on the wiktionaries. It is set up to read the
"langlinks" and "all-titles" part of the dumps. Just as I do that, the
dumps fall down. Again. And no-one cares one whit; not even a reply
here. (The bug was replied to after 4 days, and *might* be fixed
presently, after 9 days?)

My course of action now is to write new code to use thousands of API
calls to get the information, albeit as efficiently as I can. When I
do that, the chance that it will ever go back to using the dumps is a
very close approximation to zero. After all, it will work somewhat
better that way.

Other people, *many*, *many*, other people are being *forced* to do
the same, to maintain their apps and functions based on the WMF data.
And there is no chance in hell they will go back to the dump "service"
either.

Brion, Tim, et al: you are worried about overall server load? Get the
dumps working. This morning. And make it crystal clear that they will
not break, and you will be checking them n times a day and they can be
utterly, totally, absolutely relied upon.

It's like that. People will use what *works*.

Want people to use the dumps? Make them WORK.

Want everyone to just dynamically crawl the live DB, with whatever
screwy lousy inefficiency? FIne, just continue as you are, where that
is all that can be relied upon!

Look at the other threads: people asking if they can crawl the English
WP at one per second, or maybe what?  Is that what you want? That is
what you are telling people to do, when the dump "service" says
"2009-02-12 06:52:16 pswiki: Dump in progress" at the top on the 22nd
of February.

FYI for all others: if you want content dumps of the English
Wiktionary, they are available in the usual XML format at

http://devtionary.info/w/dump/xmlu/

at ~ 09;00 UTC. Every day.

With my best regards,
Robert

On Tue, Feb 17, 2009 at 7:35 PM, Russell Blau  wrote:
> "Andreas Meier"  wrote in message
> news:4997d645.8050...@gmx.de...
>> Hello,
>>
>> the current dump building seem to be dead and perhaps should be killed
>> by hand.
>>
>
> Reported: https://bugzilla.wikimedia.org/show_bug.cgi?id=17535
>
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dump processes seem to be dead

2009-02-17 Thread Russell Blau
"Andreas Meier"  wrote in message 
news:4997d645.8050...@gmx.de...
> Hello,
>
> the current dump building seem to be dead and perhaps should be killed
> by hand.
>

Reported: https://bugzilla.wikimedia.org/show_bug.cgi?id=17535




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Dump processes seem to be dead

2009-02-15 Thread Andreas Meier
Hello,

the current dump building seem to be dead and perhaps should be killed 
by hand.

Best regards

Andim

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l