Re: [gridengine users] OT: IBM to acquire Platform Computing!

2011-10-13 Thread Chris Smith
On Tue, Oct 11, 2011 at 11:19, Rayson Ho  wrote:
> On Tue, Oct 11, 2011 at 2:08 PM, Chris Dagdigian  wrote:
>> On a related note I was talking to a former Platform person who I'm sure
>> many of us know on this list and he mentioned that the stripped down older
>> variant of Platform LSF that platform produced back in the day ("lava") has
>> a new open source home and developer group:
>>
>>  http://openlava.net/
>
> Hmm, OpenLava is not backed by Platform Computing... and according to
> the domain record, seems like the project is started by a Bright
> Computing employee.
>
> BTW, given IBM's open source track record, I believe Platform LSF will
> be a bit more open when IBM finally takes control.
>
While I would love to see LSF open sourced, I don't have quite the
faith you have, Rayson. IBM also has a lot of software that they don't
open source (LL being one!). I'm the one who mentioned openlava to
Chris, by the way. For the folks I deal with, it's got enough
functionality and horsepower to get the job done. And I'm motivated to
create some momentum around it so that it continues to improve.

-- Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Strange Errors requesting advance reservation

2011-10-13 Thread Mauro Donadello
Hi List, 
I need your help, if is possible. 
I would like to reserve some resources for an incoming event. 
I proceed with this row :
qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -l "num_proc=10,h_vmem=50G" -a 
10131740 -d 3:00:00  -u semmuser


Your job (-l h_vmem=50G,num_proc=10) cannot run at host 
"puffo.bioinfo.ifom-ieo-campus.it" because it offers only hl:num_proc=24.00
verification: no suitable queues

Ok, then I try without processors:
qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -u semmuser -l h_vmem=50G -a 
10131740 -d 3:00:00 
Your job (-l h_vmem=50G) cannot run in queue "al...@puffo.ifom-ieo-campus.it" 
because job requests unknown resource (h_vmem)

I do not understand why such errors . 
Could you help me to find the issue?


-- 
Mauro Donadello 

Sistemi Informativi

COGENTECH - Consortium for Genomic Technologies

Via Adamello, 16 - 20139 Milan,Italy
T 39 02 574303048
E mauro.donade...@ifom-ieo-campus.it
W www.ifom-ieo-campus.it 
W www.ieo.it 
---


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] OT: IBM to acquire Platform Computing!

2011-10-13 Thread Rayson Ho
On Thu, Oct 13, 2011 at 11:35 AM, Chris Smith  wrote:
> While I would love to see LSF open sourced, I don't have quite the
> faith you have, Rayson.

Hi Chris,

Haven't worked with you for a long time... (10+ years?)

> IBM also has a lot of software that they don't
> open source (LL being one!).

I agree. Tivoli, which I have experience with, is not open source.
(And the list goes on: Websphere, DB2, AIX, JIT/compilers, z/OS, IBM
i, etc)

By open, I hope "IBM LSF" would at least release the documentation for
download. More importantly, IBM sales have better integrity, and they
don't commonly use sneaky FUD-based sales tactics.


> I'm the one who mentioned openlava to
> Chris, by the way. For the folks I deal with, it's got enough
> functionality and horsepower to get the job done. And I'm motivated to
> create some momentum around it so that it continues to improve.

Yup, given your experience with LSF, I guess you and other LSF experts
would find OpenLava easier to migrate to.

Rayson


>
> -- Chris
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Strange Errors requesting advance reservation

2011-10-13 Thread Reuti
Hi,

Am 13.10.2011 um 17:42 schrieb Mauro Donadello:

> Hi List, 
> I need your help, if is possible. 
> I would like to reserve some resources for an incoming event. 
> I proceed with this row :
> qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -l "num_proc=10,h_vmem=50G" -a 
> 10131740 -d 3:00:00  -u semmuser

num_proc is like a feature of an exechost and should be treated like that. I.e. 
you request a machine which has 10 cores built in. But the machine in question 
has 24. If you want to reserve 10 slots out of the 24, it's better to request a 
PE (parallel environment) with 10 slots on this machine.


> Your job (-l h_vmem=50G,num_proc=10) cannot run at host 
> "puffo.bioinfo.ifom-ieo-campus.it" because it offers only 
> hl:num_proc=24.00
> verification: no suitable queues
> 
> Ok, then I try without processors:
> qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -u semmuser -l h_vmem=50G -a 
> 10131740 -d 3:00:00 
> Your job (-l h_vmem=50G) cannot run in queue "al...@puffo.ifom-ieo-campus.it" 
> because job requests unknown resource (h_vmem)

This is indeed strange. h_vmem is still included in `qconf -sc` I assume?

-- Reuti


> I do not understand why such errors . 
> Could you help me to find the issue?
> 
> 
> -- 
> Mauro Donadello 
> 
> Sistemi Informativi
> 
> COGENTECH - Consortium for Genomic Technologies
> 
> Via Adamello, 16 - 20139 Milan,Italy
> T 39 02 574303048
> E mauro.donade...@ifom-ieo-campus.it
> W www.ifom-ieo-campus.it 
> W www.ieo.it 
> ---
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Accounting file problem

2011-10-13 Thread Laurent Duchesne
Hi all,

I'd like to have your input on a problem we are facing right now:

We have a small script which parses the SGE (6.2u5) accounting file
and writes information in a SQL database. We just found out about what
seems to be a problem in the accounting file. From man 5 accounting:

ru_wallclock
       Difference between end_time and start_time (see above).

We use that particular field to gather statistics for our users. What
we found out was that when the "failed" field is 37, the ru_wallclock
field is always 0, even if the job did run. We don't know exactly
under which circumstances this happens yet.

Here's one such entry from the accounting file:

med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.00:0:0:0:0:134261699:23127:0:0.00:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.00:0.00:0.00:-l
h_rt=86400 -pe default 512:0.00:NONE:0.00:0:0

And it's qacct output:

==
qname        med
hostname     r104-n7
group        nne-790-01
owner        sboisver12
project      nne-790-ab
department   defaultdepartment
jobname      SRA024407-Ray-1.4.0-k31-group1
jobnumber    2903640
taskid       undefined
account      sge
priority     0
qsub_time    Mon May 30 14:49:45 2011
start_time   Sat Jun  4 09:45:50 2011
end_time     Tue Jun  7 14:19:15 2011
granted_pe   default
slots        512
failed       37  : qmaster enforced h_rt limit
exit_status  0
ru_wallclock 0
ru_utime     1023454.939
ru_stime     617405.204
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    134261699
ru_majflt    23127
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     23568146
ru_nivcsw    18934035
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined

Has anyone experienced this before? Is this a known "bug/feature"?

Thanks,

--
Laurent Duchesne
CLUMEQ, Université Laval

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Accounting file problem

2011-10-13 Thread Reuti
Am 13.10.2011 um 18:10 schrieb Laurent Duchesne:

> I'd like to have your input on a problem we are facing right now:
> 
> We have a small script which parses the SGE (6.2u5) accounting file
> and writes information in a SQL database. We just found out about what
> seems to be a problem in the accounting file. From man 5 accounting:
> 
> ru_wallclock
>Difference between end_time and start_time (see above).
> 
> We use that particular field to gather statistics for our users. What
> we found out was that when the "failed" field is 37, the ru_wallclock
> field is always 0, even if the job did run. We don't know exactly
> under which circumstances this happens yet.
> 
> Here's one such entry from the accounting file:
> 
> med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.00:0:0:0:0:134261699:23127:0:0.00:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.00:0.00:0.00:-l
> h_rt=86400 -pe default 512:0.00:NONE:0.00:0:0
> 
> And it's qacct output:
> 
> ==
> qnamemed
> hostname r104-n7
> groupnne-790-01
> ownersboisver12
> project  nne-790-ab
> department   defaultdepartment
> jobname  SRA024407-Ray-1.4.0-k31-group1
> jobnumber2903640
> taskid   undefined
> account  sge
> priority 0
> qsub_timeMon May 30 14:49:45 2011
> start_time   Sat Jun  4 09:45:50 2011
> end_time Tue Jun  7 14:19:15 2011
> granted_pe   default
> slots512

What is your definition of the PE? Normally you have one entry per `qrsh` call, 
or are all 512 slots allocated on one and the same machine, unless you specify 
in the PE to sum it up.

-- Reuti


> failed   37  : qmaster enforced h_rt limit
> exit_status  0
> ru_wallclock 0
> ru_utime 1023454.939
> ru_stime 617405.204
> ru_maxrss0
> ru_ixrss 0
> ru_ismrss0
> ru_idrss 0
> ru_isrss 0
> ru_minflt134261699
> ru_majflt23127
> ru_nswap 0
> ru_inblock   0
> ru_oublock   0
> ru_msgsnd0
> ru_msgrcv0
> ru_nsignals  0
> ru_nvcsw 23568146
> ru_nivcsw18934035
> cpu  0.000
> mem  0.000
> io   0.000
> iow  0.000
> maxvmem  0.000
> arid undefined
> 
> Has anyone experienced this before? Is this a known "bug/feature"?
> 
> Thanks,
> 
> --
> Laurent Duchesne
> CLUMEQ, Université Laval
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Accounting file problem

2011-10-13 Thread Laurent Duchesne
Hi Reuti,

On Thu, Oct 13, 2011 at 12:26 PM, Reuti  wrote:
> Am 13.10.2011 um 18:10 schrieb Laurent Duchesne:
>
>> I'd like to have your input on a problem we are facing right now:
>>
>> We have a small script which parses the SGE (6.2u5) accounting file
>> and writes information in a SQL database. We just found out about what
>> seems to be a problem in the accounting file. From man 5 accounting:
>>
>> ru_wallclock
>>        Difference between end_time and start_time (see above).
>>
>> We use that particular field to gather statistics for our users. What
>> we found out was that when the "failed" field is 37, the ru_wallclock
>> field is always 0, even if the job did run. We don't know exactly
>> under which circumstances this happens yet.
>>
>> Here's one such entry from the accounting file:
>>
>> med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.00:0:0:0:0:134261699:23127:0:0.00:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.00:0.00:0.00:-l
>> h_rt=86400 -pe default 512:0.00:NONE:0.00:0:0
>>
>> And it's qacct output:
>>
>> ==
>> qname        med
>> hostname     r104-n7
>> group        nne-790-01
>> owner        sboisver12
>> project      nne-790-ab
>> department   defaultdepartment
>> jobname      SRA024407-Ray-1.4.0-k31-group1
>> jobnumber    2903640
>> taskid       undefined
>> account      sge
>> priority     0
>> qsub_time    Mon May 30 14:49:45 2011
>> start_time   Sat Jun  4 09:45:50 2011
>> end_time     Tue Jun  7 14:19:15 2011
>> granted_pe   default
>> slots        512
>
> What is your definition of the PE? Normally you have one entry per `qrsh` 
> call, or are all 512 slots allocated on one and the same machine, unless you 
> specify in the PE to sum it up.
>
> -- Reuti
>

Here's our pe definition:

pe_namedefault
slots  
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule8
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

We have only 1 entry per job/task because of the accounting_summary setting.

>
>> failed       37  : qmaster enforced h_rt limit
>> exit_status  0
>> ru_wallclock 0
>> ru_utime     1023454.939
>> ru_stime     617405.204
>> ru_maxrss    0
>> ru_ixrss     0
>> ru_ismrss    0
>> ru_idrss     0
>> ru_isrss     0
>> ru_minflt    134261699
>> ru_majflt    23127
>> ru_nswap     0
>> ru_inblock   0
>> ru_oublock   0
>> ru_msgsnd    0
>> ru_msgrcv    0
>> ru_nsignals  0
>> ru_nvcsw     23568146
>> ru_nivcsw    18934035
>> cpu          0.000
>> mem          0.000
>> io           0.000
>> iow          0.000
>> maxvmem      0.000
>> arid         undefined
>>
>> Has anyone experienced this before? Is this a known "bug/feature"?
>>
>> Thanks,
>>
>> --
>> Laurent Duchesne
>> CLUMEQ, Université Laval
>>
>> ___
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>
>

Thanks,

-- 
Laurent Duchesne
CLUMEQ, Université Laval

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users