On Thu, 16 Dec 2010 07:50:56 +0200, Andrew Dunbar
wrote:
> On 15 December 2010 20:24, Manuel Schneider
> wrote:
>> Hi Andrew,
>>
>> maybe you'd like to check out ZIM: This is an standardized file
>> format
>> for compressed HTML dumps, focused on Wikimedia content at the
>> moment.
>>
>> The
Sorry Andrew, I just notice this reply
Can you give me the url of this search page?
Thanks!
Shu
On Tue, Dec 14, 2010 at 5:04 PM, Andrew Dunbar wrote:
> On 14 December 2010 01:57, Monica shu wrote:
> > Thanks Diederik and Waksman,
> >
> > It seems that I need to do parse the dump for articl
Totally agree!
And also I think an info page listing all past versions will also be
helpful:)
Monica
On Tue, Dec 14, 2010 at 5:11 PM, Andrew Dunbar wrote:
> On 14 December 2010 20:04, Andrew Dunbar wrote:
> > On 14 December 2010 01:57, Monica shu wrote:
> >> Thanks Diederik and Waksman,
> >>
At bugzilla:18861
https://bugzilla.wikimedia.org/show_bug.cgi?id=18861
there is a discussion about how transcluded pages are not seen by the search
engine, and I
have made an assumption that is Wikisource's issue where its pages that are
transcluded
across from the Page: namespace don't make
Op 16-12-2010 13:38, Billinghurst schreef:
> At bugzilla:18861
>https://bugzilla.wikimedia.org/show_bug.cgi?id=18861
> there is a discussion about how transcluded pages are not seen by the search
> engine, and I
> have made an assumption that is Wikisource's issue where its pages that are
> t
Hi,
Am 16.12.2010 06:50, schrieb Andrew Dunbar:
> This is very interesting and I'll be watching it. Where do the HTML
> dumps come from? I'm pretty sure I've only seen "static" for Wikipedia
> and not for Wiktionary for example. I am also looking at adapting the
> parser for offline use to generat
Hi James;
download.wikimedia.org is available again, so, you can download that file
from
http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-pages-articles.xml.bz26.2
GB.
Regards,
emijrp
2010/12/14 James Linden
> On Mon, Dec 13, 2010 at 7:09 PM, Michael Gurlitz
> wrote:
> > I grabbe
Hi Monica;
You dump is this one, with date 2010-03-12:[1][2]
a3a5ee062abc16a79d111273d4a1a99a enwiki-20100312-pages-articles.xml.bz2
There are some old English Wikipedia dumps and md5sum files in a directory
called "archive"[3].
Regards,
emijrp
[1]
http://download.wikimedia.org/archive/enwiki
The dumps in the archive are there because they are incomplete, by the
way.
Ariel
Στις 16-12-2010, ημέρα Πεμ, και ώρα 16:50 +0100, ο/η emijrp έγραψε:
> Hi Monica;
>
> You dump is this one, with date 2010-03-12:[1][2]
>
> a3a5ee062abc16a79d111273d4a1a99a enwiki-20100312-pages-articles.xml.bz2
>
All? The 2006 one too?
2010/12/16 Ariel T. Glenn
> The dumps in the archive are there because they are incomplete, by the
> way.
>
> Ariel
>
> Στις 16-12-2010, ημέρα Πεμ, και ώρα 16:50 +0100, ο/η emijrp έγραψε:
> > Hi Monica;
> >
> > You dump is this one, with date 2010-03-12:[1][2]
> >
> > a3a5
I have no idea about the 2006 one; the other ones I know to be
incomplete one way or another. Working with the Jan and March 2010 run,
in conjunction with the earlier dumps, you can get complete info, see
http://techblog.wikimedia.org/2010/05/
In addition the September 2010 run
http://dumps.wikim
Ariel T. Glenn wikimedia.org> writes:
>
> We now have a copy of the dumps on a backup host. Although we are still
> resolving hardware issues on the XML dumps server, we think it is safe
> enough to serve the existing dumps read-only. DNS was updated to that
> effect already; people should see
Have you checked the md5sum?
2010/12/16 Gabriel Weinberg
> Ariel T. Glenn wikimedia.org> writes:
>
> >
> > We now have a copy of the dumps on a backup host. Although we are still
> > resolving hardware issues on the XML dumps server, we think it is safe
> > enough to serve the existing dumps r
md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
opposed to 7a4805475bba1599933b3acd5150bd4d
on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
).
I've downloaded it twice now and have gotten the same md5sum. Can anyone
else confirm?
On Thu, Dec 16, 2010
If the md5s don't match, the files are obviously different, I mean, one of
them is corrupt.
What is the size of your local file? I use to download dumps with wget UNIX
command and I don't get errors. If you are using FAT32, the file size is
limited to 2 GB and the file is truncated. Is your case?
I've been downloading this file (using wget on ubuntu or fetch on FreeBSD)
with no issues for years. The current one is 6.2GB as it should be.
On Thu, Dec 16, 2010 at 5:53 PM, emijrp wrote:
> If the md5s don't match, the files are obviously different, I mean, one of
> them is corrupt.
>
> What i
I was able to unzip a copy of the file on another host (taken from the
same location) without problems. On the download host itself I get the
correct md5sum: 7a4805475bba1599933b3acd5150bd4d
Ariel
Στις 16-12-2010, ημέρα Πεμ, και ώρα 17:48 -0500, ο/η Gabriel Weinberg
έγραψε:
> md5sum doesn't match
Thx--I guess I'll try again--third time's the charm I suppose :)
Sorry to waste your time,
Gabriel
On Thu, Dec 16, 2010 at 6:13 PM, Ariel T. Glenn wrote:
> I was able to unzip a copy of the file on another host (taken from the
> same location) without problems. On the download host itself I g
Gabriel Weinberg wrote:
> md5sum doesn't match. I get e74170eaaedc65e02249e1a54b1087cb (as
> opposed to 7a4805475bba1599933b3acd5150bd4d
> on http://download.wikimedia.org/enwiki/20101011/enwiki-20101011-md5sums.txt
> ).
>
> I've downloaded it twice now and have gotten the same md5sum. Can anyone
Roan Kattouw wrote:
> I'm not sure how hard this would be to achieve (you'd have to
> correlate blob parts with revisions manually using the text table;
> there might be gaps for deleted revs because ES is append-only) or how
> much it would help (my impression is ES is one of the slower parts of
>
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η Platonides έγραψε:
> Roan Kattouw wrote:
> > I'm not sure how hard this would be to achieve (you'd have to
> > correlate blob parts with revisions manually using the text table;
> > there might be gaps for deleted revs because ES is append-only)
21 matches
Mail list logo