Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Rolf Lampa
Marco Schuster skrev:

 Rolf Lampa  wrote:

 Doesn't the xml dumps contain the flag for flagged revs?
 
 The xml dumps are nothing for me, way too much overhead (especially,
 they are old, and I want to use single files, it's easier to process
 these than one hge xml file). And they don't contain flagged
 revisions flags :(

I traverse the last enwiki dump (last revision only) in 15 minutes (or
the Swedish svwiki in  3 min) with my stream tool (written in Delphi
Pascal).

On the go I can copy the whole thing, (takes no longer) and while at it
I can create the big three sql-tables (page, revision  text) out of
the xml dump as well, in less than 20 minutes.

I like Xml dumps. :)

I'd love, however, to see the flagged rev status as an attribute in one 
of the tags, for example revision flagged_rev=true

Regards,

// Rolf Lampa


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Platonides
Daniel Kinzler wrote:
 Rolf Lampa schrieb:
 I'd love, however, to see the flagged rev status as an attribute in one 
 of the tags, for example revision flagged_rev=true

 Regards,
 
 Naw, it's more complex than that. You can have any number of different flags. 
 It
 would probably have to be 
 revisionflagfoo/flagflagbar/flag.../revision.
 
 -- daniel

It would be flagged/, child of revision, just as minor/


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-28 Thread Thomas Dalton
2009/1/28 Platonides platoni...@gmail.com:
 Daniel Kinzler wrote:
 Rolf Lampa schrieb:
 I'd love, however, to see the flagged rev status as an attribute in one
 of the tags, for example revision flagged_rev=true

 Regards,

 Naw, it's more complex than that. You can have any number of different 
 flags. It
 would probably have to be 
 revisionflagfoo/flagflagbar/flag.../revision.

 -- daniel

 It would be flagged/, child of revision, just as minor/

But, as daniel said, flagged isn't enough, you need to know what flag.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

I want to crawl around 800.000 flagged revisions from the German
Wikipedia, in order to make a dump containing only flagged revisions.
For this, I obviously need to spider Wikipedia.
What are the limits (rate!) here, what UA should I use and what
caveats do I have to take care of?

Thanks,
Marco

PS: I already have a revisions list, created with the Toolserver. I
used the following query: select fp_stable,fp_page_id from
flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
list of all articles with flagged revs, fp_stable being the revid of
the most current flagged rev for this article?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5wcW6S2GapJUuQRAl8NAJ0Xs+ImyTqmoX2Vtj6k6PK9ntlS5wCeJjsl
M5kMETB3URYni5TilIOt8Fs=
=j7Og
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Rolf Lampa
Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
[...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, 


Doesn't the xml dumps contain the flag for flagged revs?

// Rolf Lampa

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Daniel Kinzler
Rolf Lampa schrieb:
 Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 [...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, 
 
 
 Doesn't the xml dumps contain the flag for flagged revs?
 
They don't. And that's very sad.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Platonides
Marco Schuster wrote:
 Hi all,
 
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 For this, I obviously need to spider Wikipedia.
 What are the limits (rate!) here, what UA should I use and what
 caveats do I have to take care of?
 
 Thanks,
 Marco
 
 PS: I already have a revisions list, created with the Toolserver. I
 used the following query: select fp_stable,fp_page_id from
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs, fp_stable being the revid of
 the most current flagged rev for this article?

Fetch them from the toolserver (there's a tool by duesentrieb for that).
It will catch almost all of them from the toolserver cluster, and make a
request to wikipedia only if needed.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Crawling deWP

2009-01-27 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa  wrote:
 Marco Schuster skrev:
 I want to crawl around 800.000 flagged revisions from the German
 Wikipedia, in order to make a dump containing only flagged revisions.
 [...]
 flaggedpages where fp_reviewed=1;. Is it correct this one gives me a
 list of all articles with flagged revs,


 Doesn't the xml dumps contain the flag for flagged revs?

The xml dumps are nothing for me, way too much overhead (especially,
they are old, and I want to use single files, it's easier to process
these than one hge xml file). And they don't contain flagged
revisions flags :(

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)

iD8DBQFJf5/cW6S2GapJUuQRAj1KAJ9feF3ElQTQbuENa2xfDoXJE5pq5QCfYtRd
x8lfmVHMzmVOqtO39MCfieQ=
=8YJP
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l