[Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

I want to crawl around 800.000 flagged revisions from the German
Wikipedia, in order to make a dump containing only flagged revisions.
For this, I obviously need to spider Wikipedia.
What are the limits (rate!) here, what UA should I use and what
caveats do I have to take care of?

Thanks,
Marco

PS: I already have a revisions list, created with the Toolserver. I
used the following query: "select fp_stable,fp_page_id from
flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
list of all articles with flagged revs, fp_stable being the revid of
the most current flagged rev for this article?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5wcW6S2GapJUuQRAl8NAJ0Xs+ImyTqmoX2Vtj6k6PK9ntlS5wCeJjsl
M5kMETB3URYni5TilIOt8Fs=
=j7Og
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Rolf Lampa
Marco Schuster skrev:
> I want to crawl around 800.000 flagged revisions from the German
> Wikipedia, in order to make a dump containing only flagged revisions.
[...]
> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
> list of all articles with flagged revs, 


Doesn't the xml dumps contain the flag for flagged revs?

// Rolf Lampa

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Rolf Lampa schrieb:
> Marco Schuster skrev:
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
> [...]
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs, 
> 
> 
> Doesn't the xml dumps contain the flag for flagged revs?
> 
They don't. And that's very sad.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Platonides
Marco Schuster wrote:
> Hi all,
> 
> I want to crawl around 800.000 flagged revisions from the German
> Wikipedia, in order to make a dump containing only flagged revisions.
> For this, I obviously need to spider Wikipedia.
> What are the limits (rate!) here, what UA should I use and what
> caveats do I have to take care of?
> 
> Thanks,
> Marco
> 
> PS: I already have a revisions list, created with the Toolserver. I
> used the following query: "select fp_stable,fp_page_id from
> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
> list of all articles with flagged revs, fp_stable being the revid of
> the most current flagged rev for this article?

Fetch them from the toolserver (there's a tool by duesentrieb for that).
It will catch almost all of them from the toolserver cluster, and make a
request to wikipedia only if needed.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa  wrote:
> Marco Schuster skrev:
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
> [...]
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs,
>
>
> Doesn't the xml dumps contain the flag for flagged revs?

The xml dumps are nothing for me, way too much overhead (especially,
they are old, and I want to use single files, it's easier to process
these than one hge xml file). And they don't contain flagged
revisions flags :(

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5/cW6S2GapJUuQRAj1KAJ9feF3ElQTQbuENa2xfDoXJE5pq5QCfYtRd
x8lfmVHMzmVOqtO39MCfieQ=
=8YJP
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 12:53 AM, Platonides  wrote:
> Marco Schuster wrote:
>> Hi all,
>>
>> I want to crawl around 800.000 flagged revisions from the German
>> Wikipedia, in order to make a dump containing only flagged revisions.
>> For this, I obviously need to spider Wikipedia.
>> What are the limits (rate!) here, what UA should I use and what
>> caveats do I have to take care of?
>>
>> Thanks,
>> Marco
>>
>> PS: I already have a revisions list, created with the Toolserver. I
>> used the following query: "select fp_stable,fp_page_id from
>> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
>> list of all articles with flagged revs, fp_stable being the revid of
>> the most current flagged rev for this article?
>
> Fetch them from the toolserver (there's a tool by duesentrieb for that).
> It will catch almost all of them from the toolserver cluster, and make a
> request to wikipedia only if needed.
I highly doubt this is "legal" use for the toolserver, and I pretty
much guess that 800k revisions to fetch would be a huge resource load.

Thanks, Marco

PS: CC-ing toolserver list.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf6AjW6S2GapJUuQRAvBuAJ46G0qhk+e2axFddbHFMUqzScH4PgCeIMBL
L9WWNeZaA/6vHyzSoKrGN54=
=p/R+
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Rolf Lampa
Marco Schuster skrev:

> Rolf Lampa  wrote:
>>
>> Doesn't the xml dumps contain the flag for flagged revs?
> 
> The xml dumps are nothing for me, way too much overhead (especially,
> they are old, and I want to use single files, it's easier to process
> these than one hge xml file). And they don't contain flagged
> revisions flags :(

I traverse the last enwiki dump (last revision only) in 15 minutes (or
the Swedish svwiki in < 3 min) with my stream tool (written in Delphi
Pascal).

On the go I can copy the whole thing, (takes no longer) and while at it
I can create the "big three" sql-tables (page, revision & text) out of
the xml dump as well, in less than 20 minutes.

I like Xml dumps. :)

I'd love, however, to see the flagged rev status as an attribute in one 
of the tags, for example 

Regards,

// Rolf Lampa


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Daniel Kinzler
Rolf Lampa schrieb:
> I'd love, however, to see the flagged rev status as an attribute in one 
> of the tags, for example 
> 
> Regards,

Naw, it's more complex than that. You can have any number of different flags. It
would probably have to be 
foobar

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Platonides
Daniel Kinzler wrote:
> Rolf Lampa schrieb:
>> I'd love, however, to see the flagged rev status as an attribute in one 
>> of the tags, for example 
>>
>> Regards,
> 
> Naw, it's more complex than that. You can have any number of different flags. 
> It
> would probably have to be 
> foobar
> 
> -- daniel

It would be "", child of , just as 


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Thomas Dalton
2009/1/28 Platonides :
> Daniel Kinzler wrote:
>> Rolf Lampa schrieb:
>>> I'd love, however, to see the flagged rev status as an attribute in one
>>> of the tags, for example 
>>>
>>> Regards,
>>
>> Naw, it's more complex than that. You can have any number of different 
>> flags. It
>> would probably have to be 
>> foobar
>>
>> -- daniel
>
> It would be "", child of , just as 

But, as daniel said, "flagged" isn't enough, you need to know what flag.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l