Re: [Toolserver-l] Monthly pageviews

2010-11-17 Thread Kolossos
A text file would be a good step to put the data in a database.
I did this in past with my public database u_kolossos_wp_logs_p  as 
there was http://wikistics.falsikon.de/dumps.htm with such text files,
but the last update was 2009.

Greetings Kolossos


Frédéric Schütz schrieb:
> Would text files (similar to the current page views, but summarised over 
> each month) be ok ? I have a few scripts (some from Erik Zachte, and 
> some of mine) that could be adapted to do this.
> 
> It's not the most efficient way to do it if you need random access (the 
> API from stats.grok.se is probably better for this), but it would still 
> be quite straightforward to parse.
> 
> Frédéric
> 
> On 17.11.2010 22:08, Kolossos wrote:
>> Hello, I'm also very interested to get easily montly statistics, that I
>> can use as criteria for importance of articles to show them on a map[1].
>> So, I hope we get it.
>>
>> Greetings Kolossos
>> [1] http://de.wikipedia.org/wiki/Hilfe:OpenStreetMap/en
>>
>> Magnus Manske schrieb:
>>> On Mon, Nov 15, 2010 at 10:35 PM, MZMcBride  wrote:
 Magnus Manske wrote:
> I know there are lots'o'files for daily (hourly?) pageview stats on
> the toolserver.
>
> Are there aggregated counts for the whole month? So I only have to
> check 1 file instead of hundreds (the aggregated file would, of
> course, be smaller than the concatenated hourly ones).
> Or maybe even as a database? (Onecan dream...)
>
> If not, does anyone volunteer to generate them? They'd really help
> with my GLAM tools, increase Wikimedia outreach etc.
 Pageview stats are still a mess and there's no centralized or clean
 database, as far as I'm aware. Henrik's tool (stats.grok.se) has an API you
 can hit for monthly stats: http://stats.grok.se/json/en/201006/Barack_Obama

 That's probably your best bet right now.
>>> And that's what I'm doing, but I need to look for tens of thousands of
>>> pages, and it's very slow, not to mention traffic.
>>>
  From what I understand, Wikimedia is devoting resources to setting up Open
 Web Analytics. The first test run is supposed to be this week, I think.
>>> That sounds good. Was that announced anywhere?
>>>
>>> Thanks,
>>> Magnus
>>>
>>
>> ___
>> Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
>> Posting guidelines for this list: 
>> https://wiki.toolserver.org/view/Mailing_list_etiquette
> 
> 


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Fast local storage

2010-11-17 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Marco Schuster:
> I can imagine a use of this for everything that uses the static or xml
> dumps, as well as unpacking or packing large files.

Do you have any numbers for a particular use case?

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkzkZWYACgkQIXd7fCuc5vKL/QCgudneKt9mjj7NST7Gw/n0Ciki
pZMAn3tG5Q2sbJunNfQReY5JvnB+3XZK
=RBOK
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Fast local storage

2010-11-17 Thread Marco Schuster
Hi,

I can imagine a use of this for everything that uses the static or xml
dumps, as well as unpacking or packing large files. Working with large
files over NFS sucks *ss.

Marco

On Thu, Nov 18, 2010 at 12:17 AM, River Tarnell
 wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi,
>
> A feature we may provide to users in the future is servers with fast local 
> (not
> NFS) storage for use by tools.  This would be something like user-store, 
> except
> per-server.  This storage would be redundant (RAID), but not backed up.
>
> If anyone feels like they may make use of this, it would be useful to know
> exactly how you plan to use it, e.g. how much storage you would need and for
> how long, as well as the sort of tasks it would be used for.
>
>        - river.
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.16 (FreeBSD)
>
> iEYEARECAAYFAkzkYpAACgkQIXd7fCuc5vJDygCghiTRX+/t62T4aAudXo0MoX1F
> hCcAn22PuG2jWQCUnirte70sUwx2A/j4
> =qPZo
> -END PGP SIGNATURE-
>
> ___
> Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
> Posting guidelines for this list: 
> https://wiki.toolserver.org/view/Mailing_list_etiquette
>



-- 
VMSoft GbR
Nabburger Str. 15
81737 München
Geschäftsführer: Marco Schuster, Volker Hemmert
http://vmsoft-gbr.de

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

[Toolserver-l] Fast local storage

2010-11-17 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

A feature we may provide to users in the future is servers with fast local (not 
NFS) storage for use by tools.  This would be something like user-store, except 
per-server.  This storage would be redundant (RAID), but not backed up.

If anyone feels like they may make use of this, it would be useful to know 
exactly how you plan to use it, e.g. how much storage you would need and for 
how long, as well as the sort of tasks it would be used for.

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkzkYpAACgkQIXd7fCuc5vJDygCghiTRX+/t62T4aAudXo0MoX1F
hCcAn22PuG2jWQCUnirte70sUwx2A/j4
=qPZo
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Projectcounts

2010-11-17 Thread Johan G
2010/11/17 Frédéric Schütz :
> On 17.11.2010 23:52, Johan G wrote:
>
>>> On 07.11.2010 09:47, emijrp wrote:
>>>
 Who are downloading the Domas visits logs in /mnt/user-store/stats? The
>>>
>>> I am !
>>
>> Excellent. It may or may not interest you that
>> /mnt/user-store/stats/pagecounts-20101116-130001.gz is corrupt.
>> Probably a partial download.
>
> Yes, seems to be the case. I've deleted the file and rerun the script
> that downloads the stats and it should be ok now. The file was correct
> in my personal archive (but the archiving process there does some
> integrity checking).

I confirm that the file is no longer corrupt. Thanks for the fast response.

>
> Thanks for the information !
>
> Frédéric
>
> ___
> Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
> Posting guidelines for this list: 
> https://wiki.toolserver.org/view/Mailing_list_etiquette
>

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Projectcounts

2010-11-17 Thread Frédéric Schütz
On 17.11.2010 23:52, Johan G wrote:

>> On 07.11.2010 09:47, emijrp wrote:
>>
>>> Who are downloading the Domas visits logs in /mnt/user-store/stats? The
>>
>> I am !
>
> Excellent. It may or may not interest you that
> /mnt/user-store/stats/pagecounts-20101116-130001.gz is corrupt.
> Probably a partial download.

Yes, seems to be the case. I've deleted the file and rerun the script 
that downloads the stats and it should be ok now. The file was correct 
in my personal archive (but the archiving process there does some 
integrity checking).

Thanks for the information !

Frédéric

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Projectcounts

2010-11-17 Thread Johan G
2010/11/7 Frédéric Schütz :
> On 07.11.2010 09:47, emijrp wrote:
>
>> Who are downloading the Domas visits logs in /mnt/user-store/stats? The
>
> I am !

Excellent. It may or may not interest you that
/mnt/user-store/stats/pagecounts-20101116-130001.gz is corrupt.
Probably a partial download.

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Monthly pageviews

2010-11-17 Thread Frédéric Schütz
Would text files (similar to the current page views, but summarised over 
each month) be ok ? I have a few scripts (some from Erik Zachte, and 
some of mine) that could be adapted to do this.

It's not the most efficient way to do it if you need random access (the 
API from stats.grok.se is probably better for this), but it would still 
be quite straightforward to parse.

Frédéric

On 17.11.2010 22:08, Kolossos wrote:
> Hello, I'm also very interested to get easily montly statistics, that I
> can use as criteria for importance of articles to show them on a map[1].
> So, I hope we get it.
>
> Greetings Kolossos
> [1] http://de.wikipedia.org/wiki/Hilfe:OpenStreetMap/en
>
> Magnus Manske schrieb:
>> On Mon, Nov 15, 2010 at 10:35 PM, MZMcBride  wrote:
>>> Magnus Manske wrote:
 I know there are lots'o'files for daily (hourly?) pageview stats on
 the toolserver.

 Are there aggregated counts for the whole month? So I only have to
 check 1 file instead of hundreds (the aggregated file would, of
 course, be smaller than the concatenated hourly ones).
 Or maybe even as a database? (Onecan dream...)

 If not, does anyone volunteer to generate them? They'd really help
 with my GLAM tools, increase Wikimedia outreach etc.
>>> Pageview stats are still a mess and there's no centralized or clean
>>> database, as far as I'm aware. Henrik's tool (stats.grok.se) has an API you
>>> can hit for monthly stats: http://stats.grok.se/json/en/201006/Barack_Obama
>>>
>>> That's probably your best bet right now.
>>
>> And that's what I'm doing, but I need to look for tens of thousands of
>> pages, and it's very slow, not to mention traffic.
>>
>>>  From what I understand, Wikimedia is devoting resources to setting up Open
>>> Web Analytics. The first test run is supposed to be this week, I think.
>>
>> That sounds good. Was that announced anywhere?
>>
>> Thanks,
>> Magnus
>>
>
>
> ___
> Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
> Posting guidelines for this list: 
> https://wiki.toolserver.org/view/Mailing_list_etiquette


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Jobs with running time limitation

2010-11-17 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mauro Girotto:
> Is possible to have some notifications when a job exceeds the running
> time set by h_rt?

Yes, by setting s_rt to a lower value.  When s_rt is exceeded, the program will 
receive SIGUSR1, so you can clean up and exit yourself.

 $ qsub -l h_rt=1:00:00 -l s_rt=0:55:00 myjob.sh

> What is the signal used to terminate the job?  SIGTERM or SIGKILL?

KILL.

> I've tried to set h_rt=0:1:0 and the logs (output and error) are
> empty, although the application has produced some output before the
> killing.

I tested this and the output did appear in the file when the job was killed.  
Perhaps the output wasn't flushed?  (Remember stdout is usually line-buffered 
by default, while files are fully buffered.)

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkzkRwYACgkQIXd7fCuc5vLv8ACgkl/rGVofopfEZePZb/obYU25
0jUAnjCqM5K5YQuo3G31P2aOih9uMf0y
=/zSv
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Monthly pageviews

2010-11-17 Thread Kolossos
Hello, I'm also very interested to get easily montly statistics, that I 
can use as criteria for importance of articles to show them on a map[1]. 
So, I hope we get it.

Greetings Kolossos
[1] http://de.wikipedia.org/wiki/Hilfe:OpenStreetMap/en

Magnus Manske schrieb:
> On Mon, Nov 15, 2010 at 10:35 PM, MZMcBride  wrote:
>> Magnus Manske wrote:
>>> I know there are lots'o'files for daily (hourly?) pageview stats on
>>> the toolserver.
>>>
>>> Are there aggregated counts for the whole month? So I only have to
>>> check 1 file instead of hundreds (the aggregated file would, of
>>> course, be smaller than the concatenated hourly ones).
>>> Or maybe even as a database? (Onecan dream...)
>>>
>>> If not, does anyone volunteer to generate them? They'd really help
>>> with my GLAM tools, increase Wikimedia outreach etc.
>> Pageview stats are still a mess and there's no centralized or clean
>> database, as far as I'm aware. Henrik's tool (stats.grok.se) has an API you
>> can hit for monthly stats: http://stats.grok.se/json/en/201006/Barack_Obama
>>
>> That's probably your best bet right now.
> 
> And that's what I'm doing, but I need to look for tens of thousands of
> pages, and it's very slow, not to mention traffic.
> 
>> From what I understand, Wikimedia is devoting resources to setting up Open
>> Web Analytics. The first test run is supposed to be this week, I think.
> 
> That sounds good. Was that announced anywhere?
> 
> Thanks,
> Magnus
> 


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


[Toolserver-l] Jobs with running time limitation

2010-11-17 Thread Mauro Girotto
Hi,

I've a couple of questions about the SGE.

Is possible to have some notifications when a job exceeds the running
time set by h_rt? What is the signal used to terminate the job?
SIGTERM or SIGKILL?

I've tried to set h_rt=0:1:0 and the logs (output and error) are
empty, although the application has produced some output before the
killing.

Mauro

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Newbie question: definition of a "heavy job" for toolserver

2010-11-17 Thread Alex Brollo
2010/11/17 River Tarnell 

>
> You shouldn't worry too much about resource use as long as it's not
> excessive.
> We (TS admins) will let you know if you seem to be using too many
> resources.


This is exactly what I'd like to know. :-)


> > Feel free to send me your best link to a tutorial "Unix for dummies".
>
> I usually recommend the book "Understanding UNIX" by Stan Kelly-Bootle, but
> unfortunately it's out of print, and it also doesn't answer this particular
> question.  (But it does tell you just about everything else you'd want to
> know.)
>

Thanks. There are lots of tutorials online, presently I'm very proud to
banal goals like changing PATH by python and running my first bash "Hello
world" script. Just to let you know how much dummy I am.

Alex
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Newbie question: definition of a "heavy job" for toolserver

2010-11-17 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alex Brollo:
> Have I some Unix tool to evaluate server resource used running a script?

Yes, but it depends on operating system.  If you're using a Solaris server 
(willow), use prstat:

 $ prstat -aU $LOGNAME

Or on Linux (nightshade), use top:

 $ top -u $LOGNAME

Both of these will show a list of all your processes, and various statistics, 
including CPU and memory use.  (prstat will also show a summary of your entire 
resource use.)

The interesting figures are CPU use and resident memory (called "RES" in top, 
and "RSS" in prstat).  "CPU" is the amount of CPU (in percent) that your 
process is using.  prstat shows this as total of all CPU cores, so 12.5% CPU 
means your process is using one entire CPU core (since the login servers have 8 
cores, and 100/8=12.5).  top shows this as total of a single core, so an entire 
core shows as 100% (and 200% would mean it was using two cores).

Generally you shouldn't use 100% CPU constantly, but it's okay for a program to 
use a lot of CPU sometimes and sleep (using none) at other times.  What is
acceptable CPU use really depends on what the program does; if it makes 1 edit 
to the wiki per hour and does nothing else, using 10% CPU is probably 
excessive.  On the other hand, a script that does complicated processing might 
easily use 25%. 

RES/RSS (resident memory) is the amount of RAM being used by the program.  
There is a hard limit of 1GB per user on each server, but most tools should use 
a *lot* less than that.  Again, it really depends on what the tool does.

You shouldn't worry too much about resource use as long as it's not excessive.  
We (TS admins) will let you know if you seem to be using too many resources.

> Feel free to send me your best link to a tutorial "Unix for dummies".

I usually recommend the book "Understanding UNIX" by Stan Kelly-Bootle, but 
unfortunately it's out of print, and it also doesn't answer this particular 
question.  (But it does tell you just about everything else you'd want to 
know.)

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkzjoyYACgkQIXd7fCuc5vLXQQCfaNGCiuxzXsBh1/7qabeliet7
a5oAn0VIuwQQ1nhFqX7LGVgcicXyyx8U
=5878
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Newbie question: definition of a "heavy job" for toolserver

2010-11-17 Thread Alex Brollo
2010/11/17 River Tarnell 

> Alex Brollo:
> > My question is: how many pywikipedia readings/editing per hour are a
> "heavy"
> > toolserver job?
>
> We don't usually measure resource use in "pywikipedia readings/editing per
> hour" ;-) The most common indicators of resource use are CPU time and
> memory.
>

My premise was "A newbie question" :-P

Ok. Have I some Unix tool to evaluate server resource used running a script?
And - if such a tool exists and if it will give back some exoteric result to
me - how can I evaluate it in terms of "heaviness"? I guess the best
solution: to paste and copy here the result if any, and to ask you again.
:-)

Feel free to send me your best link to a tutorial "Unix for dummies".

Alex
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Newbie question: definition of a "heavy job" for toolserver

2010-11-17 Thread River Tarnell
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alex Brollo:
> My question is: how many pywikipedia readings/editing per hour are a "heavy"
> toolserver job?

We don't usually measure resource use in "pywikipedia readings/editing per 
hour" ;-) The most common indicators of resource use are CPU time and memory.

There's also no clear point at which a job becomes "heavy" -- not least because 
"heavy" is a vague description and could mean different things to different 
people.

It would be easier to answer your question if you could provide some context...

- river.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (FreeBSD)

iEYEARECAAYFAkzjmyQACgkQIXd7fCuc5vLa1gCgvkh/jEIMXULyiIZgEFREDu+5
WgMAniK+5D3fPdR424DZT6k9zBjrsx0d
=bD4i
-END PGP SIGNATURE-

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


[Toolserver-l] Newbie question: definition of a "heavy job" for toolserver

2010-11-17 Thread Alex Brollo
Hi all, I'm a new and very ntimidited toolserver user :-)
Presently I'm doing some "Hello world" tests only, as I told in a previuos
message,  but I'd like to run from toolserver my pywikipedia bot, Alebot,
that is a rather busy one (
http://stats.wikimedia.org/wikisource/EN/BotActivityMatrix.htm).

Alebot reads at intervals itwikisource RecentChanges, selects new edits by
type, contributor and namespace, reads new or edited pages and "does
things". Such things often are nothing, often imply ad edit of one
wikisource page, sometimes are more complex, implying some more
readings/editing of some different, related  pages (max 3-4
readings/editing, reading/updating of local pickle files).

My question is: how many pywikipedia readings/editing per hour are a "heavy"
toolserver job?

Alex
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette