Re: [Wikitech-l] Offline wiki tools

2010-12-16 Thread emmanuel
On Thu, 16 Dec 2010 07:50:56 +0200, Andrew Dunbar wrote: > On 15 December 2010 20:24, Manuel Schneider > wrote: >> Hi Andrew, >> >> maybe you'd like to check out ZIM: This is an standardized file >> format >> for compressed HTML dumps, focused on Wikimedia content at the >> moment. >> >> The

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread Monica shu
Sorry Andrew, I just notice this reply Can you give me the url of this search page? Thanks! Shu On Tue, Dec 14, 2010 at 5:04 PM, Andrew Dunbar wrote: > On 14 December 2010 01:57, Monica shu wrote: > > Thanks Diederik and Waksman, > > > > It seems that I need to do parse the dump for articl

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread Monica shu
Totally agree! And also I think an info page listing all past versions will also be helpful:) Monica On Tue, Dec 14, 2010 at 5:11 PM, Andrew Dunbar wrote: > On 14 December 2010 20:04, Andrew Dunbar wrote: > > On 14 December 2010 01:57, Monica shu wrote: > >> Thanks Diederik and Waksman, > >>

[Wikitech-l] Search engine improvements for transcluded text?

2010-12-16 Thread Billinghurst
At bugzilla:18861 https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 there is a discussion about how transcluded pages are not seen by the search engine, and I have made an assumption that is Wikisource's issue where its pages that are transcluded across from the Page: namespace don't make

Re: [Wikitech-l] Search engine improvements for transcluded text?

2010-12-16 Thread Maarten Dammers
Op 16-12-2010 13:38, Billinghurst schreef: > At bugzilla:18861 >https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 > there is a discussion about how transcluded pages are not seen by the search > engine, and I > have made an assumption that is Wikisource's issue where its pages that are > t

Re: [Wikitech-l] Offline wiki tools

2010-12-16 Thread Manuel Schneider
Hi, Am 16.12.2010 06:50, schrieb Andrew Dunbar: > This is very interesting and I'll be watching it. Where do the HTML > dumps come from? I'm pretty sure I've only seen "static" for Wikipedia > and not for Wiktionary for example. I am also looking at adapting the > parser for offline use to generat

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread emijrp
Hi James; download.wikimedia.org is available again, so, you can download that file from http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-articles.xml.bz26.2 GB. Regards, emijrp 2010/12/14 James Linden > On Mon, Dec 13, 2010 at 7:09 PM, Michael Gurlitz > wrote: > > I grabbe

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread emijrp
Hi Monica; You dump is this one, with date 2010-03-12:[1][2] a3a5ee062abc16a79d111273d4a1a99a enwiki-20100312-pages-articles.xml.bz2 There are some old English Wikipedia dumps and md5sum files in a directory called "archive"[3]. Regards, emijrp [1] http://download.wikimedia.org/archive/enwiki

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread Ariel T. Glenn
The dumps in the archive are there because they are incomplete, by the way. Ariel Στις 16-12-2010, ημέρα Πεμ, και ώρα 16:50 +0100, ο/η emijrp έγραψε: > Hi Monica; > > You dump is this one, with date 2010-03-12:[1][2] > > a3a5ee062abc16a79d111273d4a1a99a enwiki-20100312-pages-articles.xml.bz2 >

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread emijrp
All? The 2006 one too? 2010/12/16 Ariel T. Glenn > The dumps in the archive are there because they are incomplete, by the > way. > > Ariel > > Στις 16-12-2010, ημέρα Πεμ, και ώρα 16:50 +0100, ο/η emijrp έγραψε: > > Hi Monica; > > > > You dump is this one, with date 2010-03-12:[1][2] > > > > a3a5

Re: [Wikitech-l] How to find the version of a dump

2010-12-16 Thread Ariel T. Glenn
I have no idea about the 2006 one; the other ones I know to be incomplete one way or another. Working with the Jan and March 2010 run, in conjunction with the earlier dumps, you can get complete info, see http://techblog.wikimedia.org/2010/05/ In addition the September 2010 run http://dumps.wikim

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
Ariel T. Glenn wikimedia.org> writes: > > We now have a copy of the dumps on a backup host. Although we are still > resolving hardware issues on the XML dumps server, we think it is safe > enough to serve the existing dumps read-only. DNS was updated to that > effect already; people should see

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread emijrp
Have you checked the md5sum? 2010/12/16 Gabriel Weinberg > Ariel T. Glenn wikimedia.org> writes: > > > > > We now have a copy of the dumps on a backup host. Although we are still > > resolving hardware issues on the XML dumps server, we think it is safe > > enough to serve the existing dumps r

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as opposed to 7a4805475bba1599933b3acd5150bd4d on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt ). I've downloaded it twice now and have gotten the same md5sum. Can anyone else confirm? On Thu, Dec 16, 2010

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread emijrp
If the md5s don't match, the files are obviously different, I mean, one of them is corrupt. What is the size of your local file? I use to download dumps with wget UNIX command and I don't get errors. If you are using FAT32, the file size is limited to 2 GB and the file is truncated. Is your case?

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
I've been downloading this file (using wget on ubuntu or fetch on FreeBSD) with no issues for years. The current one is 6.2GB as it should be. On Thu, Dec 16, 2010 at 5:53 PM, emijrp wrote: > If the md5s don't match, the files are obviously different, I mean, one of > them is corrupt. > > What i

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Ariel T. Glenn
I was able to unzip a copy of the file on another host (taken from the same location) without problems. On the download host itself I get the correct md5sum: 7a4805475bba1599933b3acd5150bd4d Ariel Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg έγραψε: > md5sum doesn't match

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Gabriel Weinberg
Thx--I guess I'll try again--third time's the charm I suppose :) Sorry to waste your time, Gabriel On Thu, Dec 16, 2010 at 6:13 PM, Ariel T. Glenn wrote: > I was able to unzip a copy of the file on another host (taken from the > same location) without problems. On the download host itself I g

Re: [Wikitech-l] dataset1, xml dumps

2010-12-16 Thread Platonides
Gabriel Weinberg wrote: > md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as > opposed to 7a4805475bba1599933b3acd5150bd4d > on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt > ). > > I've downloaded it twice now and have gotten the same md5sum. Can anyone

Re: [Wikitech-l] Parallelizing export dump (bug 24630)

2010-12-16 Thread Platonides
Roan Kattouw wrote: > I'm not sure how hard this would be to achieve (you'd have to > correlate blob parts with revisions manually using the text table; > there might be gaps for deleted revs because ES is append-only) or how > much it would help (my impression is ES is one of the slower parts of >

Re: [Wikitech-l] Parallelizing export dump (bug 24630)

2010-12-16 Thread Ariel T. Glenn
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η Platonides έγραψε: > Roan Kattouw wrote: > > I'm not sure how hard this would be to achieve (you'd have to > > correlate blob parts with revisions manually using the text table; > > there might be gaps for deleted revs because ES is append-only)