Re: [gridengine users] OT: IBM to acquire Platform Computing!
On Tue, Oct 11, 2011 at 11:19, Rayson Ho wrote: > On Tue, Oct 11, 2011 at 2:08 PM, Chris Dagdigian wrote: >> On a related note I was talking to a former Platform person who I'm sure >> many of us know on this list and he mentioned that the stripped down older >> variant of Platform LSF that platform produced back in the day ("lava") has >> a new open source home and developer group: >> >> http://openlava.net/ > > Hmm, OpenLava is not backed by Platform Computing... and according to > the domain record, seems like the project is started by a Bright > Computing employee. > > BTW, given IBM's open source track record, I believe Platform LSF will > be a bit more open when IBM finally takes control. > While I would love to see LSF open sourced, I don't have quite the faith you have, Rayson. IBM also has a lot of software that they don't open source (LL being one!). I'm the one who mentioned openlava to Chris, by the way. For the folks I deal with, it's got enough functionality and horsepower to get the job done. And I'm motivated to create some momentum around it so that it continues to improve. -- Chris ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
[gridengine users] Strange Errors requesting advance reservation
Hi List, I need your help, if is possible. I would like to reserve some resources for an incoming event. I proceed with this row : qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -l "num_proc=10,h_vmem=50G" -a 10131740 -d 3:00:00 -u semmuser Your job (-l h_vmem=50G,num_proc=10) cannot run at host "puffo.bioinfo.ifom-ieo-campus.it" because it offers only hl:num_proc=24.00 verification: no suitable queues Ok, then I try without processors: qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -u semmuser -l h_vmem=50G -a 10131740 -d 3:00:00 Your job (-l h_vmem=50G) cannot run in queue "al...@puffo.ifom-ieo-campus.it" because job requests unknown resource (h_vmem) I do not understand why such errors . Could you help me to find the issue? -- Mauro Donadello Sistemi Informativi COGENTECH - Consortium for Genomic Technologies Via Adamello, 16 - 20139 Milan,Italy T 39 02 574303048 E mauro.donade...@ifom-ieo-campus.it W www.ifom-ieo-campus.it W www.ieo.it --- ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] OT: IBM to acquire Platform Computing!
On Thu, Oct 13, 2011 at 11:35 AM, Chris Smith wrote: > While I would love to see LSF open sourced, I don't have quite the > faith you have, Rayson. Hi Chris, Haven't worked with you for a long time... (10+ years?) > IBM also has a lot of software that they don't > open source (LL being one!). I agree. Tivoli, which I have experience with, is not open source. (And the list goes on: Websphere, DB2, AIX, JIT/compilers, z/OS, IBM i, etc) By open, I hope "IBM LSF" would at least release the documentation for download. More importantly, IBM sales have better integrity, and they don't commonly use sneaky FUD-based sales tactics. > I'm the one who mentioned openlava to > Chris, by the way. For the folks I deal with, it's got enough > functionality and horsepower to get the job done. And I'm motivated to > create some momentum around it so that it continues to improve. Yup, given your experience with LSF, I guess you and other LSF experts would find OpenLava easier to migrate to. Rayson > > -- Chris > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Strange Errors requesting advance reservation
Hi, Am 13.10.2011 um 17:42 schrieb Mauro Donadello: > Hi List, > I need your help, if is possible. > I would like to reserve some resources for an incoming event. > I proceed with this row : > qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -l "num_proc=10,h_vmem=50G" -a > 10131740 -d 3:00:00 -u semmuser num_proc is like a feature of an exechost and should be treated like that. I.e. you request a machine which has 10 cores built in. But the machine in question has 24. If you want to reserve 10 slots out of the 24, it's better to request a PE (parallel environment) with 10 slots on this machine. > Your job (-l h_vmem=50G,num_proc=10) cannot run at host > "puffo.bioinfo.ifom-ieo-campus.it" because it offers only > hl:num_proc=24.00 > verification: no suitable queues > > Ok, then I try without processors: > qrsub -w v -q "*@puffo.ifom-ieo-campus.it" -u semmuser -l h_vmem=50G -a > 10131740 -d 3:00:00 > Your job (-l h_vmem=50G) cannot run in queue "al...@puffo.ifom-ieo-campus.it" > because job requests unknown resource (h_vmem) This is indeed strange. h_vmem is still included in `qconf -sc` I assume? -- Reuti > I do not understand why such errors . > Could you help me to find the issue? > > > -- > Mauro Donadello > > Sistemi Informativi > > COGENTECH - Consortium for Genomic Technologies > > Via Adamello, 16 - 20139 Milan,Italy > T 39 02 574303048 > E mauro.donade...@ifom-ieo-campus.it > W www.ifom-ieo-campus.it > W www.ieo.it > --- > > > ___ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
[gridengine users] Accounting file problem
Hi all, I'd like to have your input on a problem we are facing right now: We have a small script which parses the SGE (6.2u5) accounting file and writes information in a SQL database. We just found out about what seems to be a problem in the accounting file. From man 5 accounting: ru_wallclock Difference between end_time and start_time (see above). We use that particular field to gather statistics for our users. What we found out was that when the "failed" field is 37, the ru_wallclock field is always 0, even if the job did run. We don't know exactly under which circumstances this happens yet. Here's one such entry from the accounting file: med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.00:0:0:0:0:134261699:23127:0:0.00:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.00:0.00:0.00:-l h_rt=86400 -pe default 512:0.00:NONE:0.00:0:0 And it's qacct output: == qname med hostname r104-n7 group nne-790-01 owner sboisver12 project nne-790-ab department defaultdepartment jobname SRA024407-Ray-1.4.0-k31-group1 jobnumber 2903640 taskid undefined account sge priority 0 qsub_time Mon May 30 14:49:45 2011 start_time Sat Jun 4 09:45:50 2011 end_time Tue Jun 7 14:19:15 2011 granted_pe default slots 512 failed 37 : qmaster enforced h_rt limit exit_status 0 ru_wallclock 0 ru_utime 1023454.939 ru_stime 617405.204 ru_maxrss 0 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 134261699 ru_majflt 23127 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 23568146 ru_nivcsw 18934035 cpu 0.000 mem 0.000 io 0.000 iow 0.000 maxvmem 0.000 arid undefined Has anyone experienced this before? Is this a known "bug/feature"? Thanks, -- Laurent Duchesne CLUMEQ, Université Laval ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Accounting file problem
Am 13.10.2011 um 18:10 schrieb Laurent Duchesne: > I'd like to have your input on a problem we are facing right now: > > We have a small script which parses the SGE (6.2u5) accounting file > and writes information in a SQL database. We just found out about what > seems to be a problem in the accounting file. From man 5 accounting: > > ru_wallclock >Difference between end_time and start_time (see above). > > We use that particular field to gather statistics for our users. What > we found out was that when the "failed" field is 37, the ru_wallclock > field is always 0, even if the job did run. We don't know exactly > under which circumstances this happens yet. > > Here's one such entry from the accounting file: > > med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.00:0:0:0:0:134261699:23127:0:0.00:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.00:0.00:0.00:-l > h_rt=86400 -pe default 512:0.00:NONE:0.00:0:0 > > And it's qacct output: > > == > qnamemed > hostname r104-n7 > groupnne-790-01 > ownersboisver12 > project nne-790-ab > department defaultdepartment > jobname SRA024407-Ray-1.4.0-k31-group1 > jobnumber2903640 > taskid undefined > account sge > priority 0 > qsub_timeMon May 30 14:49:45 2011 > start_time Sat Jun 4 09:45:50 2011 > end_time Tue Jun 7 14:19:15 2011 > granted_pe default > slots512 What is your definition of the PE? Normally you have one entry per `qrsh` call, or are all 512 slots allocated on one and the same machine, unless you specify in the PE to sum it up. -- Reuti > failed 37 : qmaster enforced h_rt limit > exit_status 0 > ru_wallclock 0 > ru_utime 1023454.939 > ru_stime 617405.204 > ru_maxrss0 > ru_ixrss 0 > ru_ismrss0 > ru_idrss 0 > ru_isrss 0 > ru_minflt134261699 > ru_majflt23127 > ru_nswap 0 > ru_inblock 0 > ru_oublock 0 > ru_msgsnd0 > ru_msgrcv0 > ru_nsignals 0 > ru_nvcsw 23568146 > ru_nivcsw18934035 > cpu 0.000 > mem 0.000 > io 0.000 > iow 0.000 > maxvmem 0.000 > arid undefined > > Has anyone experienced this before? Is this a known "bug/feature"? > > Thanks, > > -- > Laurent Duchesne > CLUMEQ, Université Laval > > ___ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Accounting file problem
Hi Reuti, On Thu, Oct 13, 2011 at 12:26 PM, Reuti wrote: > Am 13.10.2011 um 18:10 schrieb Laurent Duchesne: > >> I'd like to have your input on a problem we are facing right now: >> >> We have a small script which parses the SGE (6.2u5) accounting file >> and writes information in a SQL database. We just found out about what >> seems to be a problem in the accounting file. From man 5 accounting: >> >> ru_wallclock >> Difference between end_time and start_time (see above). >> >> We use that particular field to gather statistics for our users. What >> we found out was that when the "failed" field is 37, the ru_wallclock >> field is always 0, even if the job did run. We don't know exactly >> under which circumstances this happens yet. >> >> Here's one such entry from the accounting file: >> >> med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.00:0:0:0:0:134261699:23127:0:0.00:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.00:0.00:0.00:-l >> h_rt=86400 -pe default 512:0.00:NONE:0.00:0:0 >> >> And it's qacct output: >> >> == >> qname med >> hostname r104-n7 >> group nne-790-01 >> owner sboisver12 >> project nne-790-ab >> department defaultdepartment >> jobname SRA024407-Ray-1.4.0-k31-group1 >> jobnumber 2903640 >> taskid undefined >> account sge >> priority 0 >> qsub_time Mon May 30 14:49:45 2011 >> start_time Sat Jun 4 09:45:50 2011 >> end_time Tue Jun 7 14:19:15 2011 >> granted_pe default >> slots 512 > > What is your definition of the PE? Normally you have one entry per `qrsh` > call, or are all 512 slots allocated on one and the same machine, unless you > specify in the PE to sum it up. > > -- Reuti > Here's our pe definition: pe_namedefault slots user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule8 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE We have only 1 entry per job/task because of the accounting_summary setting. > >> failed 37 : qmaster enforced h_rt limit >> exit_status 0 >> ru_wallclock 0 >> ru_utime 1023454.939 >> ru_stime 617405.204 >> ru_maxrss 0 >> ru_ixrss 0 >> ru_ismrss 0 >> ru_idrss 0 >> ru_isrss 0 >> ru_minflt 134261699 >> ru_majflt 23127 >> ru_nswap 0 >> ru_inblock 0 >> ru_oublock 0 >> ru_msgsnd 0 >> ru_msgrcv 0 >> ru_nsignals 0 >> ru_nvcsw 23568146 >> ru_nivcsw 18934035 >> cpu 0.000 >> mem 0.000 >> io 0.000 >> iow 0.000 >> maxvmem 0.000 >> arid undefined >> >> Has anyone experienced this before? Is this a known "bug/feature"? >> >> Thanks, >> >> -- >> Laurent Duchesne >> CLUMEQ, Université Laval >> >> ___ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users >> > > Thanks, -- Laurent Duchesne CLUMEQ, Université Laval ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users