Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread Михајло Анђелковић
Hm, that is very correct. The data I've got do not have this info.

But I won't run such a query again soon, since this still does the
job: for now I only want to acknowledge when somebody has left sr.wp
and to book the reason by reviewing the talk and other relevant pages
from that time.

Thank you both for the tip.

M

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Михајло Анђелковић:
> Namespaces are easily determined from the page prefix, I am not
> bothered if there are any anomalies out there (i.e. page starting with
> "User talk:" being in NS 0)

There are no page namespace prefixes in the databases.  IOW, "Foo" and 
"Talk:Foo" both have page_title="Foo".

I can't believe that's what you want, since the output will be useless.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAkz0C1AACgkQIXd7fCuc5vKVDQCfUXpleT++0/KTGPCB4Iv5fuh8
MVUAoIOjDKBPbW5NhRssqr+luRNrx13q
=9aVE
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread Михајло Анђелковић
Thank you, guys, I've already taken what I needed.

Namespaces are easily determined from the page prefix, I am not
bothered if there are any anomalies out there (i.e. page starting with
"User talk:" being in NS 0) and the query is lighter in case ns isn't
being pulled out from the DB. In overall, it was taking about 17
seconds to write down the data about 1M revisions.

Setting span for rev_ids was only to take the data in chunks, I didn't
have to specify them if I wanted to take them all at once. But hey,
such chunks are even easier to sort by rev_id.

M

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread MZMcBride
Михајло Анђелковић wrote:
> I would ask for allowance to run a request that can be resource
> consuming if not properly scaled:
> 
> SELECT page.page_title as title, rev_user_text as user, rev_timestamp
> as timestamp, rev_len as len FROM revision JOIN page ON page.page_id =
> rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;
> 
> This is intended to extract basic data about all publicly visible
> revisions from 1 to [...]. Info about each revision would be a 4-tuple
> title/user name/time/length. I need this data to start generating a
> timeline of editing of srwiki, so it is intended to be run only once
> for each revision.
> 
> If this is generally allowed to do, my question is how large chunks of
> data can I take at once, and how long should be waited between two
> takes?

srwiki_p isn't very large (3665333 revisions and 413987 pages), so I
personally wouldn't worry about performance very much at all. If you were
going to run this query on enwiki_p or another larger database, it might be
more of a concern. Run the queries that you need to run.

The "Queries" page on the Toolserver wiki might be helpful to you.[1]

Looking at your query, you should pull page.page_namespace or specify
page_namespace = 0. Pulling only page.page_title without specifying a
namespace will output useless results. I'm also unclear why you'd need to
specify rev_id > 0, though you might have your reasons for doing so.

Your Toolserver account has a quota (viewable with 'quota -v') that you
might hit if you're outputting a lot of data to disk. You can always use
/mnt/user-store/ or file a ticket in JIRA if you need an increased quota.

MZMcBride

[1] https://wiki.toolserver.org/view/Queries



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread Platonides
Михајло Анђелковић wrote:
> Hello,
> 
> I would ask for allowance to run a request that can be resource
> consuming if not properly scaled:
> 
> SELECT page.page_title as title, rev_user_text as user, rev_timestamp
> as timestamp, rev_len as len FROM revision JOIN page ON page.page_id =
> rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;
> 
> This is intended to extract basic data about all publicly visible
> revisions from 1 to [...]. Info about each revision would be a 4-tuple
> title/user name/time/length. I need this data to start generating a
> timeline of editing of srwiki, so it is intended to be run only once
> for each revision.
> 
> If this is generally allowed to do, my question is how large chunks of
> data can I take at once, and how long should be waited between two
> takes?
> 
> M

Have you considered generating the early timeline from dumps?

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Михајло Анђелковић:
> WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;

Please check that MySQL plans this correctly (using the rev_id index).

> If this is generally allowed to do, my question is how large chunks of
> data can I take at once, and how long should be waited between two
> takes?

A simple fetch like this doesn't usually take many resources.  You can probably 
pick any random number for chunking.  If you run each chunk under SGE[0] as a 
separate job, and request an sqlprocs resource, SGE will handle scheduling for 
you so there's no need to worry about how long to wait between runs.

- river.

[0] https://wiki.toolserver.org/view/Job_scheduling
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (FreeBSD)

iEYEARECAAYFAkzzsQgACgkQIXd7fCuc5vI7ZACbBkmQAFXTSBEDYm+I8xf+Ci4X
LFQAnjOPp1zFUQ/Njs90nQ4us4IZswZE
=mPqu
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread Михајло Анђелковић
Unfortunately, the complete dumps contain lots if data I don't
actually need and I am afraid I am not willing to commit such an
impact to my small HDD. And even more, they are really unavailable
since 10.11, which is kind of very long already.

Right now I have time for this research and I want to use toolserver,
unless admins do not let me do it. I asked per 'ask first'.

M

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Extracting basic revision data

2010-11-29 Thread Petr Kadlec
2010/11/29 Михајло Анђелковић :
> This is intended to extract basic data about all publicly visible
> revisions from 1 to [...]. Info about each revision would be a 4-tuple
> title/user name/time/length. I need this data to start generating a
> timeline of editing of srwiki, so it is intended to be run only once
> for each revision.

It seems to me you might want to use the dumps instead
, even though they are unavailable at
the moment…

-- [[ cs:User:Mormegil | Petr Kadlec ]]

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

[Toolserver-l] Extracting basic revision data

2010-11-28 Thread Михајло Анђелковић
Hello,

I would ask for allowance to run a request that can be resource
consuming if not properly scaled:

SELECT page.page_title as title, rev_user_text as user, rev_timestamp
as timestamp, rev_len as len FROM revision JOIN page ON page.page_id =
rev_page WHERE rev_id > 0 AND rev_id < [...] AND rev_deleted = 0;

This is intended to extract basic data about all publicly visible
revisions from 1 to [...]. Info about each revision would be a 4-tuple
title/user name/time/length. I need this data to start generating a
timeline of editing of srwiki, so it is intended to be run only once
for each revision.

If this is generally allowed to do, my question is how large chunks of
data can I take at once, and how long should be waited between two
takes?

M

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette