Re: [Wikitech-l] Crawling deWP
Marco Schuster skrev: Rolf Lampa wrote: Doesn't the xml dumps contain the flag for flagged revs? The xml dumps are nothing for me, way too much overhead (especially, they are old, and I want to use single files, it's easier to process these than one hge xml file). And they don't contain flagged revisions flags :( I traverse the last enwiki dump (last revision only) in 15 minutes (or the Swedish svwiki in 3 min) with my stream tool (written in Delphi Pascal). On the go I can copy the whole thing, (takes no longer) and while at it I can create the big three sql-tables (page, revision text) out of the xml dump as well, in less than 20 minutes. I like Xml dumps. :) I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example revision flagged_rev=true Regards, // Rolf Lampa ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Daniel Kinzler wrote: Rolf Lampa schrieb: I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example revision flagged_rev=true Regards, Naw, it's more complex than that. You can have any number of different flags. It would probably have to be revisionflagfoo/flagflagbar/flag.../revision. -- daniel It would be flagged/, child of revision, just as minor/ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
2009/1/28 Platonides platoni...@gmail.com: Daniel Kinzler wrote: Rolf Lampa schrieb: I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example revision flagged_rev=true Regards, Naw, it's more complex than that. You can have any number of different flags. It would probably have to be revisionflagfoo/flagflagbar/flag.../revision. -- daniel It would be flagged/, child of revision, just as minor/ But, as daniel said, flagged isn't enough, you need to know what flag. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Crawling deWP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of? Thanks, Marco PS: I already have a revisions list, created with the Toolserver. I used the following query: select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;. Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article? -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2) iD8DBQFJf5wcW6S2GapJUuQRAl8NAJ0Xs+ImyTqmoX2Vtj6k6PK9ntlS5wCeJjsl M5kMETB3URYni5TilIOt8Fs= =j7Og -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Marco Schuster skrev: I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. [...] flaggedpages where fp_reviewed=1;. Is it correct this one gives me a list of all articles with flagged revs, Doesn't the xml dumps contain the flag for flagged revs? // Rolf Lampa ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Rolf Lampa schrieb: Marco Schuster skrev: I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. [...] flaggedpages where fp_reviewed=1;. Is it correct this one gives me a list of all articles with flagged revs, Doesn't the xml dumps contain the flag for flagged revs? They don't. And that's very sad. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Marco Schuster wrote: Hi all, I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of? Thanks, Marco PS: I already have a revisions list, created with the Toolserver. I used the following query: select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;. Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article? Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa wrote: Marco Schuster skrev: I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. [...] flaggedpages where fp_reviewed=1;. Is it correct this one gives me a list of all articles with flagged revs, Doesn't the xml dumps contain the flag for flagged revs? The xml dumps are nothing for me, way too much overhead (especially, they are old, and I want to use single files, it's easier to process these than one hge xml file). And they don't contain flagged revisions flags :( Marco -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2) iD8DBQFJf5/cW6S2GapJUuQRAj1KAJ9feF3ElQTQbuENa2xfDoXJE5pq5QCfYtRd x8lfmVHMzmVOqtO39MCfieQ= =8YJP -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l