Re: [Wikitech-l] You too can clean out the tons of database Default Messages

2009-03-26 Thread Marco Schuster
On Thu, Mar 26, 2009 at 7:02 AM, Gerard Meijssen wrote: > Hoi, > I admire your wish for cleaning up. My question is what are we talking > about. Is this about cluttering up disk space or are these messages in > memory. Apparently they are in the MySQL database... and the less data is present in it

[Wikitech-l] URL extraction from database

2009-03-26 Thread Andreas Rindler
Hi, we are trying to extract all URLs in wiki articles from our Mediawiki installation. We have tried Grep, Perl and Sed on mysql dumps, but it is very difficult to get the URLs only, without some garbage/text/comments before or after them. Does anyone know of a better way to achieve this? Thanks

Re: [Wikitech-l] URL extraction from database

2009-03-26 Thread Bryan Tong Minh
On Thu, Mar 26, 2009 at 12:54 PM, Andreas Rindler wrote: > Hi, > we are trying to extract all URLs in wiki articles from our Mediawiki > installation. We have tried Grep, Perl and Sed on mysql dumps, but it > is very difficult to get the URLs only, without some > garbage/text/comments before or af

Re: [Wikitech-l] Help with...different...Wiki request solutions

2009-03-26 Thread Christensen, Courtney
On Wed, Mar 25, 2009 at 8:43 PM, David Di Biase wrote: > I might not have been too articulate in my question. I got it to work with > the organisation, I'm just wondering how do I get it to display as > "Einstein, Albert" on the category name? Hi Dave, How do you feel about writing PHP? You cou

Re: [Wikitech-l] Mailing lists problems

2009-03-26 Thread Tim Landscheidt
Brion Vibber wrote: >> Perhaps the auto-rejection text message could be edited to add a >> suggestion to checki this? >> Most people will not check it anyway, but If this is normal, it seems >> a good idea. > On most lists we discard, not reject, for unsubscribed messages -- > sending a mail bac

Re: [Wikitech-l] DBMS where join+limit works?

2009-03-26 Thread Tim Landscheidt
Domas Mituzas wrote: > [...] >> Does anyone know a DBMS where joining with limits actually works? > *shrug*, maybe PG does it properly, maybe MySQL 5.x does it properly, > maybe sqlite does it properly. > [...] BTW, is it feasible to replicate from MySQL to PostgreSQL (or SQLite :-)) in a conti

Re: [Wikitech-l] You too can clean out the tons of database Default Messages

2009-03-26 Thread Tim Landscheidt
Marco Schuster wrote: >> I admire your wish for cleaning up. My question is what are we talking >> about. Is this about cluttering up disk space or are these messages in >> memory. > Apparently they are in the MySQL database... and the less data is > present in it, the better, I think. First rul

Re: [Wikitech-l] You too can clean out the tons of database Default Messages

2009-03-26 Thread Aryeh Gregor
On Thu, Mar 26, 2009 at 2:02 AM, Gerard Meijssen wrote: > I admire your wish for cleaning up. My question is what are we talking > about. Is this about cluttering up disk space or are these messages in > memory. They use some disk space, which should be negligible for most people. __

Re: [Wikitech-l] URL extraction from database

2009-03-26 Thread Roan Kattouw
Andreas Rindler schreef: > Hi, > we are trying to extract all URLs in wiki articles from our Mediawiki > installation. We have tried Grep, Perl and Sed on mysql dumps, but it > is very difficult to get the URLs only, without some > garbage/text/comments before or after them. > > Does anyone know o

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Brion Vibber
On 3/25/09 5:16 PM, Roan Kattouw wrote: > I'm guessing this may be because the new file was added after r37404, > but the file registering the extension (and providing the revision > number) wasn't changed at that time, which means the most recent > revision of *that file* is still r37404. Special:

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Chad
On Thu, Mar 26, 2009 at 12:57 PM, Brion Vibber wrote: > On 3/25/09 5:16 PM, Roan Kattouw wrote: >> I'm guessing this may be because the new file was added after r37404, >> but the file registering the extension (and providing the revision >> number) wasn't changed at that time, which means the mos

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Brion Vibber
On 3/26/09 10:40 AM, Chad wrote: > I've gone ahead and removed them from extensions in r48889 and removed > support from them in r48890. Developers should instead use version numbers > that make sense. We've had the 'version' parameter in $wgExtensionCredits > since pretty much forever, use that in

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Gerard Meijssen
Hoi, I understand how revisions work. How do you use version numbers with SVN, how are extensions to be supported in combination with SVN ? Thanks, GerardM 2009/3/26 Chad > On Thu, Mar 26, 2009 at 12:57 PM, Brion Vibber > wrote: > > On 3/25/09 5:16 PM, Roan Kattouw wrote: > >> I'm guessing

[Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread ERSEK Laszlo
Hi, after reading the following sections: http://wikitech.wikimedia.org/view/Data_dump_redesign#Follow_up http://en.wikipedia.org/wiki/Wikipedia_database#Dealing_with_compressed_files http://meta.wikimedia.org/wiki/Data_dumps#bzip2 http://www.mediawiki.org/wiki/Mwdumper#Usage http://www.mediawiki

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Brion Vibber
On 3/26/09 10:46 AM, Gerard Meijssen wrote: > Hoi, > I understand how revisions work. How do you use version numbers with SVN, > how are extensions to be supported in combination with SVN ? To get useful SVN version information on the extensions you need two things: 1) The branch (trunk, or a bra

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Chad
A standardized "version" file could go a long way to solving this issue. Could maybe make them auto-generated with an on-commit hook? -Chad On Mar 26, 2009 2:17 PM, "Brion Vibber" wrote: On 3/26/09 10:46 AM, Gerard Meijssen wrote: > Hoi, > I understand how revisions work. How do you use... To g

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Brion Vibber
On 3/26/09 11:20 AM, Chad wrote: > A standardized "version" file could go a long way to > solving this issue. Could maybe make them > auto-generated with an on-commit hook? Hmmm, possible. -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Ilmari Karonen
ERSEK Laszlo wrote: > ** 4. Thanassis Tsiodras' offline reader, available under > > http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html > > uses, according to section "Seeking in the dump file", bzip2recover to > split the bzip2 blocks out of the single bzip2 stream. The page sta

Re: [Wikitech-l] Help with...different...Wiki request solutions

2009-03-26 Thread Ilmari Karonen
Christensen, Courtney wrote: > On Wed, Mar 25, 2009 at 8:43 PM, David Di Biase > wrote: >> I might not have been too articulate in my question. I got it to work with >> the organisation, I'm just wondering how do I get it to display as >> "Einstein, Albert" on the category name? > > How do you f

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Robert Rohde
On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen wrote: > ERSEK Laszlo wrote: >> ** 4. Thanassis Tsiodras' offline reader, available under >> >> http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html >> >> uses, according to section "Seeking in the dump file", bzip2recover to >> split

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Ilmari Karonen
Robert Rohde wrote: > On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen wrote: >> Hmm? Admittedly, I don't know the bzip2 format very well, but as far as >> I understand it, there should be no bit-shifting involved: each block in >> the stream is a completely independent, self-contained sequence o

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Roan Kattouw
Brion Vibber schreef: > On 3/26/09 11:20 AM, Chad wrote: >> A standardized "version" file could go a long way to >> solving this issue. Could maybe make them >> auto-generated with an on-commit hook? > > Hmmm, possible. > That'd be preferable over not providing any version information at all, wh

Re: [Wikitech-l] Help with...different...Wiki request solutions

2009-03-26 Thread Aryeh Gregor
On Thu, Mar 26, 2009 at 3:24 PM, Ilmari Karonen wrote: > --- includes/CategoryPage.php   (revision 48416) > +++ includes/CategoryPage.php   (working copy) > @@ -189,7 +189,7 @@ >         */ >        function addPage( $title, $sortkey, $pageLength, $isRedirect = false ) > { >                global

[Wikitech-l] Google Summer of Code needs you... to mentor student projects!

2009-03-26 Thread Brion Vibber
We're a mentoring organization for the Google Summer of Code again this year, and we're dead set on making it our awesomest summer ever! One key thing though is making sure that students and potential students have access to a mentor who can answer their questions and just help steer them into

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Sergey Chernyshev
Don't know if it'll solve issue at hand, but we also have extension tags for a brave few: http://svn.wikimedia.org/svnroot/mediawiki/tags/extensions/ It definitely helped me to manage my installations (rolling back releases and so on). One thing I thought it might be helpful for is for ExtensionD

Re: [Wikitech-l] On extension SVN revisions in Special:Version

2009-03-26 Thread Chad
Was talking with Roan a bit earlier on IRC about this, and we got to thinking that an external version.php file could be defined per-extension. A new 'versionfile' param could be added to $wgExtensionCredits which points to the file. The version file could contain the numeric version, the svn versi

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Keisial
Ilmari Karonen wrote: > Robert Rohde wrote: >> On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen wrote: >>> Hmm? Admittedly, I don't know the bzip2 format very well, but as far as >>> I understand it, there should be no bit-shifting involved: each block in >>> the stream is a completely independen

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Keisial
Tomasz Finc wrote: > I've started drafting some new ideas at > http://wikitech.wikimedia.org/view/Data_dump_redesign > > of the various problems that were facing and what kind of job management > we can put around it. Were taking this on as a full "should have been > done 2 years ago" project a

Re: [Wikitech-l] DBMS where join+limit works?

2009-03-26 Thread Platonides
Tim Landscheidt wrote: > BTW, is it feasible to replicate from MySQL to PostgreSQL > (or SQLite :-)) in a continuous way so that one or two > cuckoo's eggs could be planted in the database farm? Though > probably undesirable from an operations viewpoint, a real- > time performance benchmark would b

Re: [Wikitech-l] Mailing lists problems

2009-03-26 Thread Platonides
It could be rejected during the SMTP transaction. That would avoid backscatter while giving back meaningful messages. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] You too can clean out the tons of database Default Messages

2009-03-26 Thread Platonides
Aryeh Gregor wrote: > On Thu, Mar 26, 2009 at 2:02 AM, Gerard Meijssen > wrote: >> I admire your wish for cleaning up. My question is what are we talking >> about. Is this about cluttering up disk space or are these messages in >> memory. > > They use some disk space, which should be negligible f

Re: [Wikitech-l] You too can clean out the tons of database Default Messages

2009-03-26 Thread Aryeh Gregor
On Thu, Mar 26, 2009 at 7:05 PM, Platonides wrote: > All entries at NS_MEDIAWIKI are loaded from db (or memcached) on each > page request. It's a tiny amount, but I'd be wary of things moved so often. Only pages that exist will be loaded. The pages in question have been deleted and therefore wil

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread ERSEK Laszlo
On 03/26/09 20:30, Ilmari Karonen wrote: > The Wikipedia article (what else?) on the format says the blocks are > padded to byte boundaries, and some quick testing seems to support that. http://en.wikipedia.org/wiki/Bzip2#File_format The compressed blocks are bit-aligned and no padding

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Brion Vibber
On 3/26/09 3:25 PM, Keisial wrote: > Quite interesting. Can the images at office.wikimedia.org be moved to > somewhere public? I've copied those two to the public wiki. :) >> Decompression takes as long as compression with bzip2 > I think decompression is *faster* than compression > http://tukaan

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Brion Vibber
On 3/26/09 12:30 PM, Ilmari Karonen wrote: > Robert Rohde wrote: >> On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen wrote: >>> Hmm? Admittedly, I don't know the bzip2 format very well, but as far as >>> I understand it, there should be no bit-shifting involved: each block in >>> the stream is a

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Brion Vibber
On 3/26/09 10:58 AM, ERSEK Laszlo wrote: > ** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's > MO is to compress input blocks independently, then to bit-shift the > resulting compressed blocks (= single-block bzip2 streams) back into a > single multi-block bzip2 stream, so

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread ERSEK Laszlo
On 03/27/09 01:14, Brion Vibber wrote: > LZMA is nice and fast to decompress... but *insanely* slower to > compress, and doesn't seem as parallelizable. :( The xz file format should allow for "easy" parallelization, both when compressing and decompressing; see http://tukaani.org/xz/xz-file-for

Re: [Wikitech-l] Help with...different...Wiki request solutions

2009-03-26 Thread Ilmari Karonen
Aryeh Gregor wrote: > On Thu, Mar 26, 2009 at 3:24 PM, Ilmari Karonen wrote: >> --- includes/CategoryPage.php (revision 48416) >> +++ includes/CategoryPage.php (working copy) >> @@ -189,7 +189,7 @@ >> */ >>function addPage( $title, $sortkey, $pageLength, $isRedirect = false >>

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Ilmari Karonen
Brion Vibber wrote: > On 3/26/09 12:30 PM, Ilmari Karonen wrote: >> The Wikipedia article (what else?) on the format says the blocks are >> padded to byte boundaries, and some quick testing seems to support that. > > That is a filthy lie. :) > > There is indeed no byte padding between blocks; it

Re: [Wikitech-l] Help with...different...Wiki request solutions

2009-03-26 Thread Aryeh Gregor
On Thu, Mar 26, 2009 at 9:15 PM, Ilmari Karonen wrote: > Hmm, you're right, it does -- I didn't realize the title was used > unescaped.  That looks uncomfortably close to an XSS vulnerability > anyway.  I'd feel a lot more comfortable with a htmlspecialchars() in > there.  (Didn't we use to allow

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread ERSEK Laszlo
On 03/27/09 02:21, Ilmari Karonen wrote: > Brion Vibber wrote: >> On 3/26/09 12:30 PM, Ilmari Karonen wrote: >>> The Wikipedia article (what else?) on the format says the blocks are >>> padded to byte boundaries, and some quick testing seems to support that. >> That is a filthy lie. :) >> >> There

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

2009-03-26 Thread Anthony
On Thu, Mar 26, 2009 at 8:51 PM, ERSEK Laszlo wrote: > On 03/27/09 01:14, Brion Vibber wrote: > > > LZMA is nice and fast to decompress... but *insanely* slower to > > compress, and doesn't seem as parallelizable. :( > > The xz file format should allow for "easy" parallelization, both when > comp

Re: [Wikitech-l] parallel bzip2 (de)compression of the dump

2009-03-26 Thread Anthony
On Thu, Mar 26, 2009 at 3:09 PM, Ilmari Karonen wrote: > ERSEK Laszlo wrote: > > ** 4. Thanassis Tsiodras' offline reader, available under > > > > http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html > > > > use