Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Reuti
Hi,

Am 12.01.2012 um 08:00 schrieb Brendan Moloney:

> I seem to have found a combination of resource quotas that is preventing
> the scheduler from scheduling parallel jobs across multiple queues.
> 
> I have multiple queues for jobs with different run times: veryshort.q, 
> short.q , 
> long.q, and verylong.q. Each of these queues has an increasing 'h_rt' limit 
> and 
> an increasing sequence number (I have the scheduler sort by sequence 
> numbers). Each of these queues also has a decreasing number slots available.
> Jobs are then submitted with an h_rt value and the shortest queue with an 
> open slot is used. I also have a parallel environment "mpi" that is enabled 
> in 
> all of these queues.
> 
> The problem only occurs if I use resource quota sets to both limit the total 
> number of slots for the queues and limit the number of slots on each node.
> 
> For example:
> 
> {
>   name nodelimit
>   description  NONE
>   enabled  TRUE
>   limitqueues !debug.q hosts {*} to slots=$num_proc
> }
> {
>   name shortlimit
>   description  NONE
>   enabled  TRUE
>   limitqueues short.q hosts * to slots=32

I think you can leave the "hosts *" out here and the other RQS below. It means 
"used slots across all machines" limited to 32 in this queue. The same can be 
achieved by specifying only the queue.

> }
> {
>   name longlimit
>   description  NONE
>   enabled  TRUE
>   limitqueues long.q hosts * to slots=16
> }
> {
>   name verylonglimit
>   description  NONE
>   enabled  TRUE
>   limitqueues verylong.q hosts * to slots=4
> }
> {
>   name urgentlimit
>   description  NONE
>   enabled  TRUE
>   limitusers {*} queues urgent.q hosts * to slots=1
> }
> {
>   name debuglimit
>   description  NONE
>   enabled  TRUE
>   limitusers {*} queues debug.q hosts {*} to slots=1
> }

As the above 5 limits are disjunct, they can also be put in one and the same 
RQS. You can give each a name to get it listed instead of the number of the 
rule, which is always 1 right now.


> This will cause a parallel job across multiple queues to never schedule. If 
> I get rid of the "nodelimit" and instead set the number of slots using 
> the complex value in the host configuration, then everything works (except
> my debug queue).

Do you have many machinetypes? What happens, if you don't use $num_proc there 
but specify a hard coded limit per hostgroup for a machinetype or so?

limitqueues !debug.q hosts {@quadcore} to slots=4
limitqueues !debug.q hosts {@hexacore} to slots=6


> Below I give an example of a hanging job (with the scheduler output enabled).
> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and 
> verylong.q. I request 40 slots as that will have to span multiple queues. 

If I get you right, SGE could find different combinations for the slot 
allocation, depending on the algorithm which is used as all the queues are on 
the same machines?

-- Reuti


> $ qsub -w e -l h_rt=3:50:00 -pe mpi 40 test.sh
> Your job 13280 ("test.sh") has been submitted
> 
> $ qstat -u '*'
> job-ID  prior   name   user state submit/start at queue   
>slots ja-task-ID 
> -
>  13280 0.0 test.shmoloney  qw01/11/2012 21:21:32  
>  40
> 
> $ qstat -j 13280
> ==
> job_number: 13280
> exec_file:  job_scripts/13280
> submission_time:Wed Jan 11 21:21:32 2012
> owner:  moloney
> ...
> scheduling info:cannot run in queue "debug.q" because PE "mpi" is 
> not in pe list
>cannot run in queue "urgent.q" because PE "mpi" is 
> not in pe list
>cannot run because it exceeds limit "piggy/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "piggy/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "piggy/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "piggy/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "kermit/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "kermit/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "kermit/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "kermit/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "animal/" 
> in rule "nodelimit/1"
>cannot run because it exceeds limit "animal/" 
> in rul

[gridengine users] How setup queue priority?

2012-01-12 Thread Semi

I need to setup high and low priority queues for the same nodes.
I preferred to make it without subordinate lists.
I know, that the following parameters are dealing with this:
seq_no10
priority  20
If I'm right please explain me the meaning of numbers,
if no correct me.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How setup queue priority?

2012-01-12 Thread William Hay
On 12 January 2012 11:41, Semi  wrote:
> I need to setup high and low priority queues for the same nodes.


> I preferred to make it without subordinate lists.
> I know, that the following parameters are dealing with this:
> seq_no                10

The seq_no is used to determine which queue  a job will run in.

> priority              20
> If I'm right please explain me the meaning of numbers,
Whether you are right depends on what you mean by high and low
priority queues.

This is the nice value.  What this means in practice is that if there
are more processes/threads running on the node than there are
cores/threads to service them then the jobs with the lower nice value
will get more of the available CPUtime.




> if no correct me.
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
>

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How setup queue priority?

2012-01-12 Thread Semi

I have 3 queues. I want:
all.q lowest priority
mid.q middle
hig.q highest

I can solve this problem only with subordinate list?
qconf -sq hig.q
subordinate_list  all.q=1, mid.q=1
qconf -sq mid.q
subordinate_list  all.q=1

On 1/12/2012 2:04 PM, William Hay wrote:

On 12 January 2012 11:41, Semi  wrote:

I need to setup high and low priority queues for the same nodes.



I preferred to make it without subordinate lists.
I know, that the following parameters are dealing with this:
seq_no10

The seq_no is used to determine which queue  a job will run in.


priority  20
If I'm right please explain me the meaning of numbers,

Whether you are right depends on what you mean by high and low
priority queues.

This is the nice value.  What this means in practice is that if there
are more processes/threads running on the node than there are
cores/threads to service them then the jobs with the lower nice value
will get more of the available CPUtime.





if no correct me.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Automatic CPU core binding - JSV script

2012-01-12 Thread Daniel Gruber
While core binding itself should work with such an topology (I never tried it) 
in 6.2u5, the reporting of the topology string will be wrong. As you might 
noticed, 
string based load values are just reported up to a length of 1024 bytes, 
that means that with 1000 nodes not the full topology string arrives. 
Hence the m_topology and m_topology_inuse as well as the selected
cores reported from "qstat -cb" are meaningless.  

Unfortunately there is no free example script I'm aware of. SGE 6.2u6 contained 
one but this is not free. 

You should consider to use (set) the following JSV parameters:

binding_strategy, binding_type, binding_amount, binding_step, binding_socket, 
binding_core, binding_exp_n,
binding_exp_socket, binding_exp_core

with jsv_set_param for example (for the general JSV usage there are some 
examples in 6.2u5).

Often is is required to set the length of the explicit request to 0 
(jsv_set_param binding_exp_n 0)
but doing it twice is not a good idea in SGE 6.2u5. 

binding_strategy can be set to either: linear, linear_automatic, striding, 
striding_automatic, explicit, or no_job_binding.

Usually you want to set "linear_automatic" or "striding_automatic" which 
requires just to set "binding_amount" (binding_step).
This correspondents with "qsub -binding linear:N" and "qsub -binding 
striding:S:N"

Using "linear" here requires to set a start core and a start socket, which is 
on command line something like "qsub -binding linear:2:0,0".

Does a "qsub -binding linear:1" actually work with 6.2u5  on the UltraViolet 
(meaning that it binds on the execution host without any issues)?

Regards, 

Daniel


Am 12.01.2012 um 07:25 schrieb Gabor Roczei:

> Dear Reuti,
> 
> On 2012.01.11., at 22:19, Reuti wrote:
> 
>> Hi,
>> 
>> Am 11.01.2012 um 22:02 schrieb Gabor Roczei:
>> 
>>> I am looking for a JSV script which can assign CPU core binding to parallel 
>>> and  array jobs in automatic way. Is there such script somewhere on the web 
>>> which can be used for SGE 6.2u5? I will be very glad if someone could share 
>>> with me this.
>> 
>> what are you looking for in detail? The built-in `qsub -binding ...` is not 
>> sufficient, or you just want to automate this?
> 
> Our users did not use "-binding" parameter when they submit a job and I would 
> like to set it with JSV script. We have a SGI UltraViolet 1000 SMP (ccNUMA)  
> supercomputer which has this CPU topology:
> 
> hl:m_topology=SCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCC!
 
SCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCCSCC
> 
> The processes context switch is too high and we would like to use this 
> automatic binding feaute to decrease it and speed up the jobs.
> 
> Gabor
> 
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How setup queue priority?

2012-01-12 Thread Reuti
Hi,

Am 12.01.2012 um 12:41 schrieb Semi:

> I need to setup high and low priority queues for the same nodes.
> I preferred to make it without subordinate lists.
> I know, that the following parameters are dealing with this:
> seq_no10
> priority  20

In addition to William's remarks, as I'm also unsure what you refer to by 
high/low priority:

If you refer more to "urgent jobs", it's better not to think in queues (this is 
a Torque/PBS thing), but to define a boolean complex like "high" with an 
attached urgency and request this for urgent jobs `qsub -l high job.sh` so that 
they will be pushed to the top of waiting list of jobs.

-- Reuti


> If I'm right please explain me the meaning of numbers,
> if no correct me.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How setup queue priority?

2012-01-12 Thread Reuti
Am 12.01.2012 um 13:14 schrieb Semi:

> I have 3 queues. I want:
> all.q lowest priority
> mid.q middle
> hig.q highest

Still the question: which effect to you want to achieve by this? Should jobs 
start earlier in hig.q - should jobs get more CPU cycles - should jobs get 
suspended in all.q?

-- Reuti


> I can solve this problem only with subordinate list?
> qconf -sq hig.q
> subordinate_list  all.q=1, mid.q=1
> qconf -sq mid.q
> subordinate_list  all.q=1
> 
> On 1/12/2012 2:04 PM, William Hay wrote:
>> On 12 January 2012 11:41, Semi  wrote:
>>> I need to setup high and low priority queues for the same nodes.
>> 
>>> I preferred to make it without subordinate lists.
>>> I know, that the following parameters are dealing with this:
>>> seq_no10
>> The seq_no is used to determine which queue  a job will run in.
>> 
>>> priority  20
>>> If I'm right please explain me the meaning of numbers,
>> Whether you are right depends on what you mean by high and low
>> priority queues.
>> 
>> This is the nice value.  What this means in practice is that if there
>> are more processes/threads running on the node than there are
>> cores/threads to service them then the jobs with the lower nice value
>> will get more of the available CPUtime.
>> 
>> 
>> 
>> 
>>> if no correct me.
>>> 
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How setup queue priority?

2012-01-12 Thread Semi

I want to free nodes from jobs, when high priority job submitted.

On 1/12/2012 2:21 PM, Reuti wrote:

Am 12.01.2012 um 13:14 schrieb Semi:


I have 3 queues. I want:
all.q lowest priority
mid.q middle
hig.q highest

Still the question: which effect to you want to achieve by this? Should jobs 
start earlier in hig.q - should jobs get more CPU cycles - should jobs get 
suspended in all.q?

-- Reuti



I can solve this problem only with subordinate list?
qconf -sq hig.q
subordinate_list  all.q=1, mid.q=1
qconf -sq mid.q
subordinate_list  all.q=1

On 1/12/2012 2:04 PM, William Hay wrote:

On 12 January 2012 11:41, Semi   wrote:

I need to setup high and low priority queues for the same nodes.
I preferred to make it without subordinate lists.
I know, that the following parameters are dealing with this:
seq_no10

The seq_no is used to determine which queue  a job will run in.


priority  20
If I'm right please explain me the meaning of numbers,

Whether you are right depends on what you mean by high and low
priority queues.

This is the nice value.  What this means in practice is that if there
are more processes/threads running on the node than there are
cores/threads to service them then the jobs with the lower nice value
will get more of the available CPUtime.





if no correct me.

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] How setup queue priority?

2012-01-12 Thread Reuti
Am 12.01.2012 um 13:37 schrieb Semi:

> I want to free nodes from jobs, when high priority job submitted.

You mean to remove them from the nodes completely? This is tricky as SGE can't 
look ahead like: I will get an increment of h_vmem by 4 GB if I remove job 
45636. For this you would need a co-scheduler.

If jobs can still be scheduled to the node, then you could combine the 
subordination (which will suspend a queue instance and hence a job running 
therein) with a checkpointing interface to reschedule the low priority job when 
it gets suspended - so it's back to the waiting queue but at the top (instead 
of delete and resubmit the job, where it would be at the end).

-- Reuti


> On 1/12/2012 2:21 PM, Reuti wrote:
>> Am 12.01.2012 um 13:14 schrieb Semi:
>> 
>>> I have 3 queues. I want:
>>> all.q lowest priority
>>> mid.q middle
>>> hig.q highest
>> Still the question: which effect to you want to achieve by this? Should jobs 
>> start earlier in hig.q - should jobs get more CPU cycles - should jobs get 
>> suspended in all.q?
>> 
>> -- Reuti
>> 
>> 
>>> I can solve this problem only with subordinate list?
>>> qconf -sq hig.q
>>> subordinate_list  all.q=1, mid.q=1
>>> qconf -sq mid.q
>>> subordinate_list  all.q=1
>>> 
>>> On 1/12/2012 2:04 PM, William Hay wrote:
 On 12 January 2012 11:41, Semi   wrote:
> I need to setup high and low priority queues for the same nodes.
> I preferred to make it without subordinate lists.
> I know, that the following parameters are dealing with this:
> seq_no10
 The seq_no is used to determine which queue  a job will run in.
 
> priority  20
> If I'm right please explain me the meaning of numbers,
 Whether you are right depends on what you mean by high and low
 priority queues.
 
 This is the nice value.  What this means in practice is that if there
 are more processes/threads running on the node than there are
 cores/threads to service them then the jobs with the lower nice value
 will get more of the available CPUtime.
 
 
 
 
> if no correct me.
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 
> 
>>> ___
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] documentation for SGE

2012-01-12 Thread Peskin, Eric
All,

What is the best source of documentation for SGE?
  
I had been using http://wikis.sun.com/display/GridEngine
But that seems to have disappeared.

Thanks,
Eric



This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain information that is proprietary, 
confidential, and exempt from disclosure under applicable law. Any unauthorized 
review, use, disclosure, or distribution is prohibited. If you have received 
this email in error please notify the sender by return email and delete the 
original message. Please note, the recipient should check this email and any 
attachments for the presence of viruses. The organization accepts no liability 
for any damage caused by any virus transmitted by this email.
=


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] documentation for SGE

2012-01-12 Thread Gerard Henry

hello,

i've noted http://gridscheduler.sourceforge.net/documentation.html
and
http://arc.liv.ac.uk/SGE/


On 01/12/12 04:41 PM, Peskin, Eric wrote:

All,

What is the best source of documentation for SGE?

I had been using http://wikis.sun.com/display/GridEngine
But that seems to have disappeared.

Thanks,
Eric



This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain information that is proprietary, 
confidential, and exempt from disclosure under applicable law. Any unauthorized 
review, use, disclosure, or distribution is prohibited. If you have received 
this email in error please notify the sender by return email and delete the 
original message. Please note, the recipient should check this email and any 
attachments for the presence of viruses. The organization accepts no liability 
for any damage caused by any virus transmitted by this email.
=


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] deciding spool directory location

2012-01-12 Thread Wolf, Dale
We are in the planning phase for the initial installation of grid engine. The 
initial

configuration initially is a single cluster with 30 SLES 11 machines.  This 
number may

grow to as many as 100 SLES 11 servers.



The Oracle N1 Grid Engine 6 Installation Guide, under sge-root Installation 
Directory,

indicates placing the spool directory under sge-root may be avoided for 
efficiency

reasons.  Later on, under Spool Directories Under the Root Directory, it states



"You do not need to export these directories to other machines. However, 
exporting the entire sge-root tree and making it write-accessible for the 
master host and all executable hosts makes administration easier."



We are trying to determine where the spool directory should reside based on 
performance

Versus ease of administration.  Can somebody explain how ease of administration 
would

be made easier?



Thanks in advance.



Dale




___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] deciding spool directory location

2012-01-12 Thread Rayson Ho
You can reference this HOWTO:

http://gridscheduler.sourceforge.net/howto/nfsreduce.html

You can put everything on NFS, and if the NFS server can't handle the
load, then change the configuration to local spooling instead later
on.

Rayson


On Thu, Jan 12, 2012 at 12:17 PM, Wolf, Dale  wrote:
> We are in the planning phase for the initial installation of grid engine.
> The initial
>
> configuration initially is a single cluster with 30 SLES 11 machines.  This
> number may
>
> grow to as many as 100 SLES 11 servers.
>
>
>
> The Oracle N1 Grid Engine 6 Installation Guide, under sge-root Installation
> Directory,
>
> indicates placing the spool directory under sge-root may be avoided for
> efficiency
>
> reasons.  Later on, under Spool Directories Under the Root Directory, it
> states
>
>
>
> "You do not need to export these directories to other machines. However,
> exporting the entire sge-root tree and making it write-accessible for the
> master host and all executable hosts makes administration easier."
>
>
>
> We are trying to determine where the spool directory should reside based on
> performance
>
> Versus ease of administration.  Can somebody explain how ease of
> administration would
>
> be made easier?
>
>
>
> Thanks in advance.
>
>
>
> Dale
>
>
>
>
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] deciding spool directory location

2012-01-12 Thread Reuti
Hi,

Am 12.01.2012 um 18:17 schrieb Wolf, Dale:

> We are in the planning phase for the initial installation of grid engine. The 
> initial
> configuration initially is a single cluster with 30 SLES 11 machines.  This 
> number may
> grow to as many as 100 SLES 11 servers.
>  
> The Oracle N1 Grid Engine 6 Installation Guide, under sge-root Installation 
> Directory,
> indicates placing the spool directory under sge-root may be avoided for 
> efficiency
> reasons.  Later on, under Spool Directories Under the Root Directory, it 
> states
>  
> "You do not need to export these directories to other machines. However, 
> exporting the entire sge-root tree and making it write-accessible for the 
> master host and all executable hosts makes administration easier."

Well, the spool directory is inside the $SGE_ROOT/default/spool, but the best 
way for me is in the middle: to export $SGE_ROOT to all machines, while 
redirecting the spool directory to a location like /var/spool/sge, which needs 
only to be writable by the SGE admin user (the "/var/spool/sge/qmaster" needs 
to be created beforehand, while the spool directories for the nodes will be 
created automatcially when the sgeeced starts).

http://arc.liv.ac.uk/SGE/howto/nfsreduce.html

-- Reuti


> We are trying to determine where the spool directory should reside based on 
> performance
> Versus ease of administration.  Can somebody explain how ease of 
> administration would
> be made easier?
>  
> Thanks in advance.
>  
> Dale
>  
>  
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] More Univa FUD???

2012-01-12 Thread William Deegan
Chi,

On Jan 11, 2012, at 6:44 PM, Chi Chan wrote:

> So what's your point, William? Like others have already said, did you read 
> what Ron said, or are you just not happy with many forks and each with 
> features that are different, and like you said before that you needed to 
> choose one to use?

It's a pain to deal with multiple forks.

> 
> And where are your contributions? While it is perfectly fine to use SGE 
> without contributing, you are acting as if you built the SGE community like 
> others who have done it here for years, and you are feeling despair because 
> others just made your effort worthless.

As far as I know this list is open for all to comment regardless of their 
contributions to any project in particular.

> 
> Dave Love complained because Univa is spreading what he believe untrue 
> information about his fork. The Linux community complained when Microsoft did 
> the same thing to Linux. So now are you saying that Dave should not even 
> complain in the first place just because this way more people can use SGE?

Feel free to complain about anything all you want.
It would be wise to address the FUD on Dave or Rayson's or even gridengine.org 
(if Chris feels it's an appropriate place).

> 
> And you keep on asking for new features, and keep posting questions on the 
> list - but you are supporting your clients and customers for money. And do 
> you redirect your customers' feature requests to this list and ask people to 
> implement them for you?
> 

If this list isn't for asking questions? Then what is it's purpose?
Do you not earn money using gridengine? Does anyone on this list use gridengine 
just for fun? (not saying it's not fun to use..)
Are you seriously saying that it's wrong to use gridengine to make money, and 
then it's further wrong to ask questions here to support that work? Or even 
wrong to sell support for gridengine?

> * I can't find a better way to describe you than a "freeloader".

Then by your definition almost the entire world is a freeloader?

I contribute to a few open source projects (buildbot, plone) and co-manage 
SCons (scons.org).
Is reporting bugs not contributing to the project? Is asking for 
clarifications, which then become part of the email list archive which then 
other people can find via google in the middle of the night when they have the 
problem rather than having to wait for a response not providing value to the 
community?
I am by no means a freeloader on the open source community as a whole.

Having working in many startups, let me explain something to you.  No product 
has any value if no one uses it, furthermore without requests from those users 
products rarely get more useable.

-Bill

> 
> 
> --Chi
> 
> 
> 
> - 原始信件 
> 寄件者: William Deegan 
> 收件者: Ron Chen 
> 副本: users 
> 寄件日期: 2012/1/11 (三) 3:13 PM
> 主旨: Re: [gridengine users] More Univa FUD???
> 
> All,
> 
> I've been reading these discussions for a while.
> I think generally they are counter productive.
> 
> If you have a fork (currently there are 3 that I see: Univa, son of 
> gridengine, and gridengine (what's Ron Chen's project's name?), then just 
> work on making your fork the most attractive to use.
> If it's not nobody will use it.
> 
> Don't worry about the other guy and/or company.
> You're not going to change their actions (at this point) by complaining.
> If people are concerned about a given fork, they won't use it.
> 
> It would be great if there was only one repo, but I don't see that happening 
> any time soon, or ever given the dialogues on this mailing list.
> 
> That's my 2cents.
> -Bill
> 
> On Jan 10, 2012, at 10:46 PM, Ron Chen wrote:
> 
>> And I just found this one today:
>> 
>> 
>> http://www.univagridengine.com/
>> 
>> Again, as a contributor who has stayed with Oracle and Sun Grid Engine and 
>> Open Grid Scheduler for
>> over 10 years, I think it is unacceptable to register a domain using other 
>> company's product name.
>> 
>> While Univa has been using FUD against open source, this is not the way to 
>> revenge.
>> 
>> 
>>   -Ron
>> 
>> 
>> 
>> 
>> - Original Message -
>> From: Ron Chen 
>> To: Mark Magento 
>> Cc: users 
>> Sent: Friday, January 6, 2012 12:46 AM
>> Subject: Re: [gridengine users] More Univa FUD???
>> 
>> Hi Mark,
>> 
>> (Just back from my vacation and I am really late in this discussion.)
>> 
>> 
>> Did you create this website?
>> 
>> http://unicloud.wordpress.com/
>> 
>> While I am not a fan of Univa (mainly have problem with its market 
>> practices), I am also not a fan
>> of those who create a website using the name of other people's products  
>> (i.e. Univa's UniCloud),
>> with the content of the website all about bashing Univa's products.
>> 
>> If possible, please unregister that blog - it is not helping anyone in the 
>> HPC or Cloud industry.
>> 
>> Also, the "Grid Engine Truth" website referenced in the article is dead:
>> 
>> http://www.gridenginetruth.com/
>> 
>> But from WHOIS, the registrant of the site 

Re: [gridengine users] More Univa FUD???

2012-01-12 Thread Joe Landman

On 01/11/2012 01:46 AM, Ron Chen wrote:

And I just found this one today:


http://www.univagridengine.com/

Again, as a contributor who has stayed with Oracle and Sun Grid Engine and Open 
Grid Scheduler for
over 10 years, I think it is unacceptable to register a domain using other 
company's product name.


More than merely wrong, it opens up the people/company who registered it 
to legal action in the US if univa and/or gridengine are trademarks, or 
copyrighted of a particular entity.


IANAL (included so I don't get sued for providing legal advice, which I 
am definitely not doing).  I can tell you my pedestrian understanding 
is, that if someone else (person/entity) owns the name and marks you are 
misrepresenting as your own, you open yourself up to a world of legal 
hurt.


Trademark and copyright infringement all go to intent.  If Chris Dag 
owns "Dag's Muffin Shop" and has been operating this for a while, and 
along comes Joe and he opens (weird writing in the third person) a 
website named www.dagsmuffinshop.com in order to steal business from 
Chris, then Joe is *begging* to be taken to court.


Same is true with the words Univa (very likely trademark Univa UD), 
GridEngine (probably trademark Oracle, or Univa, or ...).


Chris's gridengine.info site is a good, positive site that ought to be 
highly encouraged by the owners of the marks, as it makes their marks 
more valuable (hint hint, but no one put me up to say this, nor do I 
have any financial or other interest in any of this).  Its quite 
unlikely that someone who owns the marks would perceive that as a 
threat.  On the other hand, the purposefully misdirecting sites would 
... quite likely ... inspire the ire of corporate legal folks.  Highly 
not recommended.



While Univa has been using FUD against open source, this is not the way to 
revenge.


Success is the best revenge.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] deciding spool directory location

2012-01-12 Thread Chris Dagdigian

Hi Dale,


We are trying to determine where the spool directory should reside based on 
performance

 Versus ease of administration.  Can somebody explain how ease of 
administration would
 be made easier?


Here is a short answer:

When the spool directory is shared it is far easier for an administrator 
to troubleshoot node-specific job issues. This is because you can 
see/access all of the spool/location without having to hop to a specific machine.


When spool is not shared your spool data and messages are on local disk 
on the compute nodes. This means that you have to connect to that node 
in order to read or examine the files.


More detail ...


The decision to do shared or not-shared generally revolves around the 
power of your NFS server, what else is talking on that same 
network/subnet/vlan/wire and probably more importantly how many jobs you 
might be running through your system during a day. The number of jobs 
entering and existing the system is the real factor on how often and 
hard your spool share is getting hit. Some of my pharma clusters run 
hours-long jobs and might only do a few hundred or thousand jobs per 
day. Another biotech cluster of similar size might be doing 150,000 jobs 
per day running short chemical simulations.


My gut answer is usually to do shared-spool first and only move away 
from that if performance demands it. Changing the spooling location 
post-install is not a huge deal.


I'm also a classic spooling zealot. I hate berkeleydb spooling and even 
on the 2000 core cluster that does 150,000 jobs per day we still use 
classic spooling on a NFS shared SGE Root and spool. We are, however, 
using Isilon scale-out NAS for the NFS and that means we have no real 
performance issues at all.


My $.02

-Chris



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] More Univa FUD???

2012-01-12 Thread Rayson Ho
On Thu, Jan 12, 2012 at 1:46 PM, Joe Landman
 wrote:
> More than merely wrong, it opens up the people/company who registered it to
> legal action in the US if univa and/or gridengine are trademarks, or
> copyrighted of a particular entity.

Joe, you haven't showed up on the Grid Engine lists for a long while! :-D

"Univa" is a registered trademark (for computer software design, ... ,
grid computing applications, etc, then the owner is Univa Corporation.
For other industries, then the trademark is owned by other "Univa"
companies).

"Grid Engine" is not a trademark. I think Oracle/Sun can still
trademark it, but others may not be possible to do so as we have 4
implementations of "Grid Engine", plus Xoreax Grid Engine which has
nothing to do with SGE. And if I understand the US copyright &
trademark rules correctly, one can't trademark commonly used terms or
words like "computer", "chair", or even "windows" unless one uses it
in a different context (like naming an OS "Windows"). On the other
hand, Linus applied for the "Linux" trademark after Linux was wide
spread, but he is the first person using the word "Linux" to refer to
his products.

Rayson



>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics Inc.
> email: land...@scalableinformatics.com
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] More Univa FUD???

2012-01-12 Thread Joe Landman

On 01/12/2012 02:14 PM, Rayson Ho wrote:

On Thu, Jan 12, 2012 at 1:46 PM, Joe Landman
  wrote:

More than merely wrong, it opens up the people/company who registered it to
legal action in the US if univa and/or gridengine are trademarks, or
copyrighted of a particular entity.


Joe, you haven't showed up on the Grid Engine lists for a long while! :-D


Hi Rayson,

  Been lurking.



"Univa" is a registered trademark (for computer software design, ... ,
grid computing applications, etc, then the owner is Univa Corporation.
For other industries, then the trademark is owned by other "Univa"
companies).


The question is always one of possible market confusion.  If there is no 
possible confusion, then generally there is no issue.  This said, some 
companies who make otherwise fine products, have an   er  
unfortunate posture of filing iLawsuits against others for even remote 
similarities ... :(




"Grid Engine" is not a trademark. I think Oracle/Sun can still
trademark it, but others may not be possible to do so as we have 4
implementations of "Grid Engine", plus Xoreax Grid Engine which has
nothing to do with SGE. And if I understand the US copyright&


The US is a "first to file" regime, with proof of use of the trademark 
in market.  That is, you can't trademark something that hasn't actually 
been sold and in market yet.  Weird.



trademark rules correctly, one can't trademark commonly used terms or
words like "computer", "chair", or even "windows" unless one uses it


Heh ... well, with sufficient resources, you can try (e.g. Microsoft, 
Apple, ...).



in a different context (like naming an OS "Windows"). On the other
hand, Linus applied for the "Linux" trademark after Linux was wide
spread, but he is the first person using the word "Linux" to refer to
his products.


We've gone through this with our products ... I am pretty up to speed on 
what is required.  In short, the product has to be in market for at 
least 6 months before the mark can be granted, has to be non-conflicting 
(e.g. I can't create "Joe's Microsoft Windows" and trademark that, I'd 
be taken to court), and a bunch of other things.  Shortly after we 
received our mark for JackRabbit, we found out that Apache had a 
project, oddly enough, for "storage".  Its not overlapping, non 
conflicting.  Yes, you can have Apache JackRabbit on Scalable's 
JackRabbit, but they are in different markets, serving different needs. 
 No conflict as far as we are concerned.


There was a lawyer who had filed and was granted a trademark on Linux 
before Linus had it.  The USPTO had to effectively cancel the original 
assignment and reassign it to Linus.  I don't know the particulars of 
that case, but it was something of a source of concern in the mid/late 90s.


Its generally "first to file" and you have to show that you have a 
legitimate claim to it.  Without that legitimate claim (you own the 
products/IP/... that bear that name, etc.) this is problematic.


Windows et al are hard.  Not sure about Univa and GridEngine.  When 
reviewing things, the USPTO takes a very dim view of off by one names.




Rayson





--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: land...@scalableinformatics.com
web  : http://scalableinformatics.com
   http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Brendan Moloney
Hello,

>> {
>>   name shortlimit
>>   description  NONE
>>   enabled  TRUE
>>   limitqueues short.q hosts * to slots=32

> I think you can leave the "hosts *" out here and the other RQS below. It 
> means "used slots across all machines" limited to 32 in this queue. The same 
> can be achieved by specifying only the queue.

Yes, I ended up making some things overly explicit while trying to debug the 
issue.

>> }
>> {
>>   name longlimit
>>   description  NONE
>>   enabled  TRUE
>>   limitqueues long.q hosts * to slots=16
>> }
>> {
>>   name verylonglimit
>>   description  NONE
>>   enabled  TRUE
>>   limitqueues verylong.q hosts * to slots=4
>> }
>> {
>>   name urgentlimit
>>   description  NONE
>>   enabled  TRUE
>>   limitusers {*} queues urgent.q hosts * to slots=1
>> }
>> {
>>   name debuglimit
>>   description  NONE
>>   enabled  TRUE
>>   limitusers {*} queues debug.q hosts {*} to slots=1
>> }

>As the above 5 limits are disjunct, they can also be put in one and the same 
>RQS. You can give each a name to get it listed instead of the number of the 
>rule, which is always 1 right now.

I originally had these as one RQS, but again tried to make things more explicit 
(or at least easier for me to understand) while debugging.

>> This will cause a parallel job across multiple queues to never schedule. If
>> I get rid of the "nodelimit" and instead set the number of slots using
>> the complex value in the host configuration, then everything works (except
>> my debug queue).

>Do you have many machinetypes? What happens, if you don't use $num_proc there 
>but specify a hard coded limit per hostgroup for a machinetype or so?
>
>limitqueues !debug.q hosts {@quadcore} to slots=4
>limitqueues !debug.q hosts {@hexacore} to slots=6

I don't have many machine types, in fact I don't have many machines! I tried to 
replace the nodelimit RQS with:

{
   name nodelimit
   description  NONE
   enabled  TRUE
   limitqueues !debug.q hosts {animal.ohsu.edu,kermit.ohsu.edu} to 
slots=24
   limitqueues !debug.q hosts {piggy.ohsu.edu} to slots=8
}

This gives the same result as the original nodelimit RQS that used $num_proc 
(the job never gets scheduled).

>> Below I give an example of a hanging job (with the scheduler output enabled).
>> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and
>> verylong.q. I request 40 slots as that will have to span multiple queues.

>If I get you right, SGE could find different combinations for the slot 
>allocation, depending on the algorithm which is used as all the queues are on 
>the same machines?

All the queues are on the same machines. I am not sure which "algorithm" you 
refer to. As mentioned, the scheduler sorts by sequence number so the queues 
are checked in shortest to longest order. Thus my job that requests 40 slots 
with the given h_rt value should take 32 slots from short.q and 8 slots from 
long.q (provided nothing else is running on the cluster, which is the case for 
my testing). 

Thanks,
Brendan

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Reuti
Hi,

Am 12.01.2012 um 22:07 schrieb Brendan Moloney:

> Hello,
> 
>>> {
>>>  name shortlimit
>>>  description  NONE
>>>  enabled  TRUE
>>>  limitqueues short.q hosts * to slots=32
> 
>> I think you can leave the "hosts *" out here and the other RQS below. It 
>> means "used slots across all machines" limited to 32 in this queue. The same 
>> can be achieved by specifying only the queue.
> 
> Yes, I ended up making some things overly explicit while trying to debug the 
> issue.
> 
>>> }
>>> {
>>>  name longlimit
>>>  description  NONE
>>>  enabled  TRUE
>>>  limitqueues long.q hosts * to slots=16
>>> }
>>> {
>>>  name verylonglimit
>>>  description  NONE
>>>  enabled  TRUE
>>>  limitqueues verylong.q hosts * to slots=4
>>> }
>>> {
>>>  name urgentlimit
>>>  description  NONE
>>>  enabled  TRUE
>>>  limitusers {*} queues urgent.q hosts * to slots=1
>>> }
>>> {
>>>  name debuglimit
>>>  description  NONE
>>>  enabled  TRUE
>>>  limitusers {*} queues debug.q hosts {*} to slots=1
>>> }
> 
>> As the above 5 limits are disjunct, they can also be put in one and the same 
>> RQS. You can give each a name to get it listed instead of the number of the 
>> rule, which is always 1 right now.
> 
> I originally had these as one RQS, but again tried to make things more 
> explicit (or at least easier for me to understand) while debugging.
> 
>>> This will cause a parallel job across multiple queues to never schedule. If
>>> I get rid of the "nodelimit" and instead set the number of slots using
>>> the complex value in the host configuration, then everything works (except
>>> my debug queue).
> 
>> Do you have many machinetypes? What happens, if you don't use $num_proc 
>> there but specify a hard coded limit per hostgroup for a machinetype or so?
>> 
>> limitqueues !debug.q hosts {@quadcore} to slots=4
>> limitqueues !debug.q hosts {@hexacore} to slots=6
> 
> I don't have many machine types, in fact I don't have many machines! I tried 
> to replace the nodelimit RQS with:
> 
> {
>   name nodelimit
>   description  NONE
>   enabled  TRUE
>   limitqueues !debug.q hosts {animal.ohsu.edu,kermit.ohsu.edu} to 
> slots=24
>   limitqueues !debug.q hosts {piggy.ohsu.edu} to slots=8
> }
> 
> This gives the same result as the original nodelimit RQS that used $num_proc 
> (the job never gets scheduled).
> 
>>> Below I give an example of a hanging job (with the scheduler output 
>>> enabled).
>>> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and
>>> verylong.q. I request 40 slots as that will have to span multiple queues.
> 
>> If I get you right, SGE could find different combinations for the slot 
>> allocation, depending on the algorithm which is used as all the queues are 
>> on the same machines?
> 
> All the queues are on the same machines. I am not sure which "algorithm" you 
> refer to.

I refer to the internal algorithm of SGE how to collect slots from various 
queues.

> As mentioned, the scheduler sorts by sequence number so the queues are 
> checked in shortest to longest order.

Not for parallel jobs. Only the allocation_rule is used (except for $pe_slots).

http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least

Does your observation fit to the aspects of parallel jobs at the end of the 
above link?

> Thus my job that requests 40 slots with the given h_rt value should take 32 
> slots from short.q and 8 slots from long.q (provided nothing else is running 
> on the cluster, which is the case for my testing).

Interesting. Collecting slots from different queues has some implications 
anyway:

- the name of the $TMPDIR depends on the name of the queue, hence it's not the 
same on all nodes
- `qrsh -inherit ...` can't distinguish between the granted queues:

https://arc.liv.ac.uk/trac/SGE/ticket/813

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] My notes on building Open GridScheduler 2011.11 on RedHat/CentOS 6.x based systems

2012-01-12 Thread Chris Dagdigian


Tried to reverse engineer my crusty old build environment into something 
that I (or even others) can actually replicate or follow.


Going to try similar for 32bit binaries as well as document the process 
for RHEL/CentOS 5.x based systems in the near future...


Short link:
http://biote.am/6y

Long link:
http://bioteam.net/2012/01/building-open-grid-scheduler-on-centos-rhel-6-2/

Feedback welcome.

-dag



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Brendan Moloney
>> All the queues are on the same machines. I am not sure which "algorithm" you 
>> refer to.
>
>I refer to the internal algorithm of SGE how to collect slots from various 
>queues.
>
>> As mentioned, the scheduler sorts by sequence number so the queues are 
>> checked in shortest to longest order.
>
>Not for parallel jobs. Only the allocation_rule is used (except for $pe_slots).
>
>http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least
>
>Does your observation fit to the aspects of parallel jobs at the end of the 
>above link?

There is definitely still some interaction between the scheduler configuration 
and the pe allocation rule. The allocation rule for the "mpi" pe is 
$round_robin. When I run this example successfully (the per node slot limits 
done through complex values) then the grid engine will do round robin 
allocation in short.q (animal and kermit get 12 slots, piggy gets 8) followed 
by round robin allocation in long.q (animal and kermit get 4 slots).

>Interesting. Collecting slots from different queues has some implications 
>anyway:
>
>- the name of the $TMPDIR depends on the name of the queue, hence it's not the 
>same on all nodes

This should not be an issue for correctly written software, right?

>- `qrsh -inherit ...` can't distinguish between the granted queues:
>https://arc.liv.ac.uk/trac/SGE/ticket/813

I don't think this will affect us. We only run MPI programs with a tightly 
integrated MPICH2 or SMP programs with the allocation rule set to $pe_slots.

So is it safe to say that I have found a bug? It seems like my original RQS 
should work. Or at least doing qsub with '-w e' should fail immediately instead 
of allowing the job to wait in 'qw' state forever.

Thanks,
Brendan
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] My notes on building Open GridScheduler 2011.11 on RedHat/CentOS 6.x based systems

2012-01-12 Thread Rayson Ho
Thanks Chris for posting this - I've never tried to build OGS outside
of our machines or EC2 images.

And we needed to use "BerkeleyDB version 4.4.20" because the on-disk
data structure is not compatible across different releases of Berkeley
DB - it's not Oracle's fault, but it's just that it is not engineered
that way. And in order to read back configuration & jobs from an
existing SGE installation, we need BDB 4.4.x -- we think compatibility
with older SGE versions is more important and thus we try not to break
it if possible. For a fresh install, on-disk data is not an issue, and
one can safely use newer releases of Berkeley DB. I modified OGS to
use newer releases of Berkeley (including Berkeley DB 11g R2), but the
change and other changes were not included in GE 2011.11 because we
needed to release "something" for SC11 and thus non critical features
were all skipped - I am going to integrate some of the changes back
into trunk soon.

Rayson



On Thu, Jan 12, 2012 at 5:45 PM, Chris Dagdigian  wrote:
>
> Tried to reverse engineer my crusty old build environment into something
> that I (or even others) can actually replicate or follow.
>
> Going to try similar for 32bit binaries as well as document the process for
> RHEL/CentOS 5.x based systems in the near future...
>
> Short link:
> http://biote.am/6y
>
> Long link:
> http://bioteam.net/2012/01/building-open-grid-scheduler-on-centos-rhel-6-2/
>
> Feedback welcome.
>
> -dag
>
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Available values in prolog

2012-01-12 Thread Michael Coffman
We are in the process of developing a gridwatcher utility that is launched
in the background from the prolog script.   The intent is to have a
process monitor various aspects of the job and store or report on them.

It currently determines the pid of the shepherd process then watches all
the children processes.

Initially it will be watching memory usage and if a job begins using more
physical memory than requested, the user will be notified.  That's where
my question comes from.

Is there any way in the prolog to get access to the hard_request options
besides using qstat?

What I'm currently doing:

  cmd = "bash -c '. #{@sge_root}/default/common/settings.sh && qstat
-xml -j #{@number}'"

I have thought of possibly setting an environment variable via a jsv script
that can be queried by the prolog script.  Is this a good idea?  How much impact
on submission time does jsv_send_env() add?

Any one else doing anything like this have any suggestions?


The end goal is to have a utility that users can also interact with to
monitor their jobs.  By either setting environment variables or grid
complexes to affect the behavior of what is being watched and how they
are notified.

Thanks.

-- 
-MichaelC
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Available values in prolog

2012-01-12 Thread Reuti
Hi,

Am 13.01.2012 um 01:03 schrieb Michael Coffman:

> We are in the process of developing a gridwatcher utility that is launched
> in the background from the prolog script.   The intent is to have a
> process monitor various aspects of the job and store or report on them.

this is of course an interesting goal. What are you missing right now?


> It currently determines the pid of the shepherd process then watches all
> the children processes.

I think it's easier to use the additional group ID, which is attached to all 
kids by SGE, whether they jump out of the process tree or not. This one is 
recorded in $SGE_JOB_SPOOL_DIR in the file "addgrpid".


> Initially it will be watching memory usage and if a job begins using more
> physical memory than requested, the user will be notified.  That's where
> my question comes from.

What about setting a soft limit for h_vmem and prepare the job script to handle 
ithe signal to send an email. How will they request memory - by virtual_free?


> Is there any way in the prolog to get access to the hard_request options
> besides using qstat?
> 
> What I'm currently doing:
> 
>  cmd = "bash -c '. #{@sge_root}/default/common/settings.sh && qstat
> -xml -j #{@number}'"
> 
> I have thought of possibly setting an environment variable via a jsv script
> that can be queried by the prolog script.  Is this a good idea?  How much 
> impact
> on submission time does jsv_send_env() add?

You can use either a JSV or a `qsub` wrapper for it.


> Any one else doing anything like this have any suggestions?
> 
> 
> The end goal is to have a utility that users can also interact with to
> monitor their jobs.  By either setting environment variables or grid
> complexes

Complexes are only handled internally by SGE. There is no user command to 
change them for a non-admin.


> to affect the behavior of what is being watched and how they
> are notified.

AFAIK you can't change the content of an already inherited variable, as the 
process got a copy of the value. Also /proc/12345/environ is only readable. And 
your "observation daemon" will run on all nodes - one for each job from the 
prolog if I get you right?

But a nice solution could be the usage of the job context. This can be set by 
the user on the command line, and your job can access this by issuing a similar 
command like you did already. If the exechosts are submit hosts, the job can 
also change this by using `qalter` like the user has to use on the command 
line. We use the job context only for documentation purpose, to record the 
issued command and append it to the email which is send after the job.

http://gridengine.org/pipermail/users/2011-September/001629.html

$ qstat -j 12345
...
context:COMMAND=subturbo -v 631 -g -m 3500 -p 8 -t infinity 
-s aoforce,OUTPUT=/home/foobar/carbene/gecl4_2carb228/trans_tzvp_3.out

It's only one long line, and I split it later on to inidividual entries. In 
your case you have to watch out for commas, as they are used already to 
separate entries.

-- Reuti


> Thanks.
> 
> -- 
> -MichaelC
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

2012-01-12 Thread Reuti
Am 12.01.2012 um 23:52 schrieb Brendan Moloney:

>>> All the queues are on the same machines. I am not sure which "algorithm" 
>>> you refer to.
>> 
>> I refer to the internal algorithm of SGE how to collect slots from various 
>> queues.
>> 
>>> As mentioned, the scheduler sorts by sequence number so the queues are 
>>> checked in shortest to longest order.
>> 
>> Not for parallel jobs. Only the allocation_rule is used (except for 
>> $pe_slots).
>> 
>> http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least
>> 
>> Does your observation fit to the aspects of parallel jobs at the end of the 
>> above link?
> 
> There is definitely still some interaction between the scheduler 
> configuration and the pe allocation rule. The allocation rule for the "mpi" 
> pe is $round_robin. When I run this example successfully (the per node slot 
> limits done through complex values) then the grid engine will do round robin 
> allocation in short.q (animal and kermit get 12 slots, piggy gets 8) followed 
> by round robin allocation in long.q (animal and kermit get 4 slots).
> 
>> Interesting. Collecting slots from different queues has some implications 
>> anyway:
>> 
>> - the name of the $TMPDIR depends on the name of the queue, hence it's not 
>> the same on all nodes
> 
> This should not be an issue for correctly written software, right?

This depends on what you define as "correctly":

Case 1: you have no queuing system, users are requested to create by hand 
something like /scratch/reuti/foobar17 on all nodes for a particular job. You 
set this value as an argument to `mpiexec` and you are quite happy that it's 
forwarded by the application internally to all nodes. Changing ~/.profile to 
set it by the ssh login would mean to change it for each `mpiexec`. Even if 
it's only /scratch/reuti to be created as a one time setup, it's the same on 
all nodes. No need to set any variable.

Case 2: You have a queuing system and want to use $TMPDIR - it must be the one 
on the node, not the one forwarded from the master node of the parallel job 
like in case 1. It depends whether the software honors something like $TMP, 
$TMPDIR or has the behavior like in case 1.

Case 3: The software is just using the $PWD for its scratch data. Hence you 
make a `cd $TMPDIR` on the master node and this will also be used as path on 
all slave nodes. If the directory isn't there, you are out of luck or use only 
/tmp (or your home) and lose the handling of $TMPDIR by SGE.

In fact: this was tricky with some applications with Codine 5.3 - no cluster 
queues, and although the $TMPDIR was created on the slave nodes they all had 
different names, as each queue had an unique name like node01.long.q 
node02.long.q (with only one host per queue)... IIRC I made a loop across the 
involved nodes to create a symbolic link with a name I like to Codine's created 
$TMPDIR. Oh dear, long ago...


>> - `qrsh -inherit ...` can't distinguish between the granted queues:
>> https://arc.liv.ac.uk/trac/SGE/ticket/813
> 
> I don't think this will affect us. We only run MPI programs with a tightly 
> integrated MPICH2 or SMP programs with the allocation rule set to $pe_slots.
> 
> So is it safe to say that I have found a bug?

I think so. The limit in the RQS should be handled as you expect it, especially 
as it's working as you note by setting individual slot counts in the exechost 
definitions.


> It seems like my original RQS should work.

Yes.


> Or at least doing qsub with '-w e' should fail immediately instead of 
> allowing the job to wait in 'qw' state forever.

This would be like "no suitable queue", but it first finds a possible 
assignment but fails to collect slots later on.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Simon Matthews
I have an installation of SGE 6.2U4 that I downloaded some years ago that I
have installed on a couple of qmaster hosts.

I hope that I do not offend the users of this list by asking for help
using a binary installation, using binaries built by Sun.

I hope that someone can shed some light on the problem.

I have built some new virtualized clients using KVM on a Centos 6 host. The
Centos 5 client seems to work properly, but the Centos 4 client does not. I
need a Centos 4 execd for testing purposes.

I cannot install sge_execd, because of the qconf problems.

qconf -sh results in:

qconf -sh
ERROR: failed receiving gdi request response for mid=1 (got no message).

I get this message if I try this client against the new cluster and a
cluster that has been running for several years. Other Centos 4 clients can
run "qconf -ch" against both clusters without problem.

qping works from the problematic client:
 qping -info sgemaster 6444 qmaster 1
01/12/2012 20:59:38:
SIRM version: 0.1
SIRM message id:  1
start time:   01/12/2012 16:31:57 (1326414717)
run time [s]: 16052
messages in read buffer:  0
messages in write buffer: 0
nr. of connected clients: 2
status:   2
info: MAIN: E (16052.50) | signaler000: E (16052.05) |
event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59) |
worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58) |
scheduler000: W (5.57) | ERROR
malloc:   arena(0) |ordblks(1) | smblks(0) | hblksr(0) |
hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) | keepcost(0)
Monitor:  disabled

Simon
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Rayson Ho
On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
 wrote:
> I have an installation of SGE 6.2U4 that I downloaded some years ago that I
> have installed on a couple of qmaster hosts.

Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
& the node? You can run "qconf -help | head -1" on both the master &
the node to show the version.

Most common GDI errors are due to mismatching versions of Grid Engine
- and if you are running the same version of SGE, then let us know, I
will dig the code to see what can possibly go wrong.

And don't worry about still using the Sun binaries, I work with sites
that have even older versions of Grid Engine. Sun contributed the code
to open source, and without Sun we wouldn't have this community.

Rayson




> I hope that I do not offend the users of this list by asking for help  using
> a binary installation, using binaries built by Sun.
>
> I hope that someone can shed some light on the problem.
>
> I have built some new virtualized clients using KVM on a Centos 6 host. The
> Centos 5 client seems to work properly, but the Centos 4 client does not. I
> need a Centos 4 execd for testing purposes.
>
> I cannot install sge_execd, because of the qconf problems.
>
> qconf -sh results in:
>
> qconf -sh
> ERROR: failed receiving gdi request response for mid=1 (got no message).
>
> I get this message if I try this client against the new cluster and a
> cluster that has been running for several years. Other Centos 4 clients can
> run "qconf -ch" against both clusters without problem.
>
> qping works from the problematic client:
>  qping -info sgemaster 6444 qmaster 1
> 01/12/2012 20:59:38:
> SIRM version: 0.1
> SIRM message id:  1
> start time:   01/12/2012 16:31:57 (1326414717)
> run time [s]: 16052
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 2
> status:   2
> info: MAIN: E (16052.50) | signaler000: E (16052.05) |
> event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59) |
> worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58) |
> scheduler000: W (5.57) | ERROR
> malloc:   arena(0) |ordblks(1) | smblks(0) | hblksr(0) |
> hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) | keepcost(0)
> Monitor:  disabled
>
> Simon
>
>
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Simon Matthews
I am running the same version. I have one installation tree that is NFS
mounted. All clients use the same binaries.

I had wanted to move to 6.2U5, but I can't find a source to download it.

Simon

On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho  wrote:

> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
>  wrote:
> > I have an installation of SGE 6.2U4 that I downloaded some years ago
> that I
> > have installed on a couple of qmaster hosts.
>
> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
> & the node? You can run "qconf -help | head -1" on both the master &
> the node to show the version.
>
> Most common GDI errors are due to mismatching versions of Grid Engine
> - and if you are running the same version of SGE, then let us know, I
> will dig the code to see what can possibly go wrong.
>
> And don't worry about still using the Sun binaries, I work with sites
> that have even older versions of Grid Engine. Sun contributed the code
> to open source, and without Sun we wouldn't have this community.
>
> Rayson
>
>
>
>
> > I hope that I do not offend the users of this list by asking for help
> using
> > a binary installation, using binaries built by Sun.
> >
> > I hope that someone can shed some light on the problem.
> >
> > I have built some new virtualized clients using KVM on a Centos 6 host.
> The
> > Centos 5 client seems to work properly, but the Centos 4 client does
> not. I
> > need a Centos 4 execd for testing purposes.
> >
> > I cannot install sge_execd, because of the qconf problems.
> >
> > qconf -sh results in:
> >
> > qconf -sh
> > ERROR: failed receiving gdi request response for mid=1 (got no message).
> >
> > I get this message if I try this client against the new cluster and a
> > cluster that has been running for several years. Other Centos 4 clients
> can
> > run "qconf -ch" against both clusters without problem.
> >
> > qping works from the problematic client:
> >  qping -info sgemaster 6444 qmaster 1
> > 01/12/2012 20:59:38:
> > SIRM version: 0.1
> > SIRM message id:  1
> > start time:   01/12/2012 16:31:57 (1326414717)
> > run time [s]: 16052
> > messages in read buffer:  0
> > messages in write buffer: 0
> > nr. of connected clients: 2
> > status:   2
> > info: MAIN: E (16052.50) | signaler000: E (16052.05)
> |
> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59) |
> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58) |
> > scheduler000: W (5.57) | ERROR
> > malloc:   arena(0) |ordblks(1) | smblks(0) | hblksr(0) |
> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
> keepcost(0)
> > Monitor:  disabled
> >
> > Simon
> >
> >
> >
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> >
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Simon Matthews
On Thu, Jan 12, 2012 at 10:00 PM, Simon Matthews  wrote:

> I am running the same version. I have one installation tree that is NFS
> mounted. All clients use the same binaries.
>
> I had wanted to move to 6.2U5, but I can't find a source to download it.
>

Arrgh --- apologies for the top posting!

>
> Simon
>
>
> On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho wrote:
>
>> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
>>  wrote:
>> > I have an installation of SGE 6.2U4 that I downloaded some years ago
>> that I
>> > have installed on a couple of qmaster hosts.
>>
>> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
>> & the node? You can run "qconf -help | head -1" on both the master &
>> the node to show the version.
>>
>> Most common GDI errors are due to mismatching versions of Grid Engine
>> - and if you are running the same version of SGE, then let us know, I
>> will dig the code to see what can possibly go wrong.
>>
>> And don't worry about still using the Sun binaries, I work with sites
>> that have even older versions of Grid Engine. Sun contributed the code
>> to open source, and without Sun we wouldn't have this community.
>>
>> Rayson
>>
>>
>>
>>
>> > I hope that I do not offend the users of this list by asking for help
>> using
>> > a binary installation, using binaries built by Sun.
>> >
>> > I hope that someone can shed some light on the problem.
>> >
>> > I have built some new virtualized clients using KVM on a Centos 6 host.
>> The
>> > Centos 5 client seems to work properly, but the Centos 4 client does
>> not. I
>> > need a Centos 4 execd for testing purposes.
>> >
>> > I cannot install sge_execd, because of the qconf problems.
>> >
>> > qconf -sh results in:
>> >
>> > qconf -sh
>> > ERROR: failed receiving gdi request response for mid=1 (got no message).
>> >
>> > I get this message if I try this client against the new cluster and a
>> > cluster that has been running for several years. Other Centos 4 clients
>> can
>> > run "qconf -ch" against both clusters without problem.
>> >
>> > qping works from the problematic client:
>> >  qping -info sgemaster 6444 qmaster 1
>> > 01/12/2012 20:59:38:
>> > SIRM version: 0.1
>> > SIRM message id:  1
>> > start time:   01/12/2012 16:31:57 (1326414717)
>> > run time [s]: 16052
>> > messages in read buffer:  0
>> > messages in write buffer: 0
>> > nr. of connected clients: 2
>> > status:   2
>> > info: MAIN: E (16052.50) | signaler000: E
>> (16052.05) |
>> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59) |
>> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58) |
>> > scheduler000: W (5.57) | ERROR
>> > malloc:   arena(0) |ordblks(1) | smblks(0) | hblksr(0) |
>> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
>> keepcost(0)
>> > Monitor:  disabled
>> >
>> > Simon
>> >
>> >
>> >
>> > ___
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>> >
>>
>
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Simon Matthews
On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho  wrote:

> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
>  wrote:
> > I have an installation of SGE 6.2U4 that I downloaded some years ago
> that I
> > have installed on a couple of qmaster hosts.
>
> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
> & the node? You can run "qconf -help | head -1" on both the master &
> the node to show the version.
>
> Most common GDI errors are due to mismatching versions of Grid Engine
> - and if you are running the same version of SGE, then let us know, I
> will dig the code to see what can possibly go wrong.
>
> And don't worry about still using the Sun binaries, I work with sites
> that have even older versions of Grid Engine. Sun contributed the code
> to open source, and without Sun we wouldn't have this community.
>


Just in case it is related -- occasionally, I see the following on the
console:
 warning many ticks lost
your time source seems to be instable or some driver is hogging interrupts.
rip default_idle+0x20/0x23

Simon

>
> Rayson
>
>
>
>
> > I hope that I do not offend the users of this list by asking for help
> using
> > a binary installation, using binaries built by Sun.
> >
> > I hope that someone can shed some light on the problem.
> >
> > I have built some new virtualized clients using KVM on a Centos 6 host.
> The
> > Centos 5 client seems to work properly, but the Centos 4 client does
> not. I
> > need a Centos 4 execd for testing purposes.
> >
> > I cannot install sge_execd, because of the qconf problems.
> >
> > qconf -sh results in:
> >
> > qconf -sh
> > ERROR: failed receiving gdi request response for mid=1 (got no message).
> >
> > I get this message if I try this client against the new cluster and a
> > cluster that has been running for several years. Other Centos 4 clients
> can
> > run "qconf -ch" against both clusters without problem.
> >
> > qping works from the problematic client:
> >  qping -info sgemaster 6444 qmaster 1
> > 01/12/2012 20:59:38:
> > SIRM version: 0.1
> > SIRM message id:  1
> > start time:   01/12/2012 16:31:57 (1326414717)
> > run time [s]: 16052
> > messages in read buffer:  0
> > messages in write buffer: 0
> > nr. of connected clients: 2
> > status:   2
> > info: MAIN: E (16052.50) | signaler000: E (16052.05)
> |
> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59) |
> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58) |
> > scheduler000: W (5.57) | ERROR
> > malloc:   arena(0) |ordblks(1) | smblks(0) | hblksr(0) |
> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
> keepcost(0)
> > Monitor:  disabled
> >
> > Simon
> >
> >
> >
> > ___
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> >
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Rayson Ho
Does it hang when you issue the qconf command on that node, or does it
return the error message immediately??

Rayson


On Fri, Jan 13, 2012 at 1:00 AM, Simon Matthews
 wrote:
> I am running the same version. I have one installation tree that is NFS
> mounted. All clients use the same binaries.
>
> I had wanted to move to 6.2U5, but I can't find a source to download it.
>
> Simon
>
>
> On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho  wrote:
>>
>> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
>>  wrote:
>> > I have an installation of SGE 6.2U4 that I downloaded some years ago
>> > that I
>> > have installed on a couple of qmaster hosts.
>>
>> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
>> & the node? You can run "qconf -help | head -1" on both the master &
>> the node to show the version.
>>
>> Most common GDI errors are due to mismatching versions of Grid Engine
>> - and if you are running the same version of SGE, then let us know, I
>> will dig the code to see what can possibly go wrong.
>>
>> And don't worry about still using the Sun binaries, I work with sites
>> that have even older versions of Grid Engine. Sun contributed the code
>> to open source, and without Sun we wouldn't have this community.
>>
>> Rayson
>>
>>
>>
>>
>> > I hope that I do not offend the users of this list by asking for help
>> > using
>> > a binary installation, using binaries built by Sun.
>> >
>> > I hope that someone can shed some light on the problem.
>> >
>> > I have built some new virtualized clients using KVM on a Centos 6 host.
>> > The
>> > Centos 5 client seems to work properly, but the Centos 4 client does
>> > not. I
>> > need a Centos 4 execd for testing purposes.
>> >
>> > I cannot install sge_execd, because of the qconf problems.
>> >
>> > qconf -sh results in:
>> >
>> > qconf -sh
>> > ERROR: failed receiving gdi request response for mid=1 (got no message).
>> >
>> > I get this message if I try this client against the new cluster and a
>> > cluster that has been running for several years. Other Centos 4 clients
>> > can
>> > run "qconf -ch" against both clusters without problem.
>> >
>> > qping works from the problematic client:
>> >  qping -info sgemaster 6444 qmaster 1
>> > 01/12/2012 20:59:38:
>> > SIRM version: 0.1
>> > SIRM message id:  1
>> > start time:   01/12/2012 16:31:57 (1326414717)
>> > run time [s]: 16052
>> > messages in read buffer:  0
>> > messages in write buffer: 0
>> > nr. of connected clients: 2
>> > status:   2
>> > info: MAIN: E (16052.50) | signaler000: E (16052.05)
>> > |
>> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59) |
>> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58) |
>> > scheduler000: W (5.57) | ERROR
>> > malloc:   arena(0) |ordblks(1) | smblks(0) | hblksr(0) |
>> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
>> > keepcost(0)
>> > Monitor:  disabled
>> >
>> > Simon
>> >
>> >
>> >
>> > ___
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>> >
>
>

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Simon Matthews
On Thu, Jan 12, 2012 at 10:15 PM, Rayson Ho wrote:

> Does it hang when you issue the qconf command on that node, or does it
> return the error message immediately??
>

It hangs. I see the message either after it times out or if I kill it.

Simon

>
> Rayson
>
>
> On Fri, Jan 13, 2012 at 1:00 AM, Simon Matthews
>  wrote:
> > I am running the same version. I have one installation tree that is NFS
> > mounted. All clients use the same binaries.
> >
> > I had wanted to move to 6.2U5, but I can't find a source to download it.
> >
> > Simon
> >
> >
> > On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho 
> wrote:
> >>
> >> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
> >>  wrote:
> >> > I have an installation of SGE 6.2U4 that I downloaded some years ago
> >> > that I
> >> > have installed on a couple of qmaster hosts.
> >>
> >> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
> >> & the node? You can run "qconf -help | head -1" on both the master &
> >> the node to show the version.
> >>
> >> Most common GDI errors are due to mismatching versions of Grid Engine
> >> - and if you are running the same version of SGE, then let us know, I
> >> will dig the code to see what can possibly go wrong.
> >>
> >> And don't worry about still using the Sun binaries, I work with sites
> >> that have even older versions of Grid Engine. Sun contributed the code
> >> to open source, and without Sun we wouldn't have this community.
> >>
> >> Rayson
> >>
> >>
> >>
> >>
> >> > I hope that I do not offend the users of this list by asking for help
> >> > using
> >> > a binary installation, using binaries built by Sun.
> >> >
> >> > I hope that someone can shed some light on the problem.
> >> >
> >> > I have built some new virtualized clients using KVM on a Centos 6
> host.
> >> > The
> >> > Centos 5 client seems to work properly, but the Centos 4 client does
> >> > not. I
> >> > need a Centos 4 execd for testing purposes.
> >> >
> >> > I cannot install sge_execd, because of the qconf problems.
> >> >
> >> > qconf -sh results in:
> >> >
> >> > qconf -sh
> >> > ERROR: failed receiving gdi request response for mid=1 (got no
> message).
> >> >
> >> > I get this message if I try this client against the new cluster and a
> >> > cluster that has been running for several years. Other Centos 4
> clients
> >> > can
> >> > run "qconf -ch" against both clusters without problem.
> >> >
> >> > qping works from the problematic client:
> >> >  qping -info sgemaster 6444 qmaster 1
> >> > 01/12/2012 20:59:38:
> >> > SIRM version: 0.1
> >> > SIRM message id:  1
> >> > start time:   01/12/2012 16:31:57 (1326414717)
> >> > run time [s]: 16052
> >> > messages in read buffer:  0
> >> > messages in write buffer: 0
> >> > nr. of connected clients: 2
> >> > status:   2
> >> > info: MAIN: E (16052.50) | signaler000: E
> (16052.05)
> >> > |
> >> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59)
> |
> >> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58)
> |
> >> > scheduler000: W (5.57) | ERROR
> >> > malloc:   arena(0) |ordblks(1) | smblks(0) |
> hblksr(0) |
> >> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
> >> > keepcost(0)
> >> > Monitor:  disabled
> >> >
> >> > Simon
> >> >
> >> >
> >> >
> >> > ___
> >> > users mailing list
> >> > users@gridengine.org
> >> > https://gridengine.org/mailman/listinfo/users
> >> >
> >
> >
>
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Rayson Ho
Good! That means qconf is waiting for the master's response but not getting it.

If an IP filter or firewall is configured on that node, then it is
very likely to be the cause. Make sure that firewalls are turned off
or configured properly... I used to use sniffers like TCP dump to
debug issues like this, but I have not used sniffers for a long while.

Rayson



On Fri, Jan 13, 2012 at 1:25 AM, Simon Matthews
 wrote:
>
>
> On Thu, Jan 12, 2012 at 10:15 PM, Rayson Ho 
> wrote:
>>
>> Does it hang when you issue the qconf command on that node, or does it
>> return the error message immediately??
>
>
> It hangs. I see the message either after it times out or if I kill it.
>
> Simon
>>
>>
>> Rayson
>>
>>
>> On Fri, Jan 13, 2012 at 1:00 AM, Simon Matthews
>>  wrote:
>> > I am running the same version. I have one installation tree that is NFS
>> > mounted. All clients use the same binaries.
>> >
>> > I had wanted to move to 6.2U5, but I can't find a source to download it.
>> >
>> > Simon
>> >
>> >
>> > On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho 
>> > wrote:
>> >>
>> >> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
>> >>  wrote:
>> >> > I have an installation of SGE 6.2U4 that I downloaded some years ago
>> >> > that I
>> >> > have installed on a couple of qmaster hosts.
>> >>
>> >> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
>> >> & the node? You can run "qconf -help | head -1" on both the master &
>> >> the node to show the version.
>> >>
>> >> Most common GDI errors are due to mismatching versions of Grid Engine
>> >> - and if you are running the same version of SGE, then let us know, I
>> >> will dig the code to see what can possibly go wrong.
>> >>
>> >> And don't worry about still using the Sun binaries, I work with sites
>> >> that have even older versions of Grid Engine. Sun contributed the code
>> >> to open source, and without Sun we wouldn't have this community.
>> >>
>> >> Rayson
>> >>
>> >>
>> >>
>> >>
>> >> > I hope that I do not offend the users of this list by asking for help
>> >> > using
>> >> > a binary installation, using binaries built by Sun.
>> >> >
>> >> > I hope that someone can shed some light on the problem.
>> >> >
>> >> > I have built some new virtualized clients using KVM on a Centos 6
>> >> > host.
>> >> > The
>> >> > Centos 5 client seems to work properly, but the Centos 4 client does
>> >> > not. I
>> >> > need a Centos 4 execd for testing purposes.
>> >> >
>> >> > I cannot install sge_execd, because of the qconf problems.
>> >> >
>> >> > qconf -sh results in:
>> >> >
>> >> > qconf -sh
>> >> > ERROR: failed receiving gdi request response for mid=1 (got no
>> >> > message).
>> >> >
>> >> > I get this message if I try this client against the new cluster and a
>> >> > cluster that has been running for several years. Other Centos 4
>> >> > clients
>> >> > can
>> >> > run "qconf -ch" against both clusters without problem.
>> >> >
>> >> > qping works from the problematic client:
>> >> >  qping -info sgemaster 6444 qmaster 1
>> >> > 01/12/2012 20:59:38:
>> >> > SIRM version: 0.1
>> >> > SIRM message id:  1
>> >> > start time:   01/12/2012 16:31:57 (1326414717)
>> >> > run time [s]: 16052
>> >> > messages in read buffer:  0
>> >> > messages in write buffer: 0
>> >> > nr. of connected clients: 2
>> >> > status:   2
>> >> > info: MAIN: E (16052.50) | signaler000: E
>> >> > (16052.05)
>> >> > |
>> >> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59)
>> >> > |
>> >> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58)
>> >> > |
>> >> > scheduler000: W (5.57) | ERROR
>> >> > malloc:   arena(0) |ordblks(1) | smblks(0) |
>> >> > hblksr(0) |
>> >> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
>> >> > keepcost(0)
>> >> > Monitor:  disabled
>> >> >
>> >> > Simon
>> >> >
>> >> >
>> >> >
>> >> > ___
>> >> > users mailing list
>> >> > users@gridengine.org
>> >> > https://gridengine.org/mailman/listinfo/users
>> >> >
>> >
>> >
>
>

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] qconf -sh fails on Centos 4 guest.

2012-01-12 Thread Rayson Ho
Simon,

I'm logging off now, please let the list know whether it's still
causing problems and/or your findings.

(I'm in North America - EST timezone, and I normally don't stay up
this late - but it usually takes me some time to get back to the
normal daily schedule after the holidays :-D )

Rayson



On Fri, Jan 13, 2012 at 1:36 AM, Rayson Ho  wrote:
> Good! That means qconf is waiting for the master's response but not getting 
> it.
>
> If an IP filter or firewall is configured on that node, then it is
> very likely to be the cause. Make sure that firewalls are turned off
> or configured properly... I used to use sniffers like TCP dump to
> debug issues like this, but I have not used sniffers for a long while.
>
> Rayson
>
>
>
> On Fri, Jan 13, 2012 at 1:25 AM, Simon Matthews
>  wrote:
>>
>>
>> On Thu, Jan 12, 2012 at 10:15 PM, Rayson Ho 
>> wrote:
>>>
>>> Does it hang when you issue the qconf command on that node, or does it
>>> return the error message immediately??
>>
>>
>> It hangs. I see the message either after it times out or if I kill it.
>>
>> Simon
>>>
>>>
>>> Rayson
>>>
>>>
>>> On Fri, Jan 13, 2012 at 1:00 AM, Simon Matthews
>>>  wrote:
>>> > I am running the same version. I have one installation tree that is NFS
>>> > mounted. All clients use the same binaries.
>>> >
>>> > I had wanted to move to 6.2U5, but I can't find a source to download it.
>>> >
>>> > Simon
>>> >
>>> >
>>> > On Thu, Jan 12, 2012 at 9:50 PM, Rayson Ho 
>>> > wrote:
>>> >>
>>> >> On Fri, Jan 13, 2012 at 12:02 AM, Simon Matthews
>>> >>  wrote:
>>> >> > I have an installation of SGE 6.2U4 that I downloaded some years ago
>>> >> > that I
>>> >> > have installed on a couple of qmaster hosts.
>>> >>
>>> >> Are you using the same version of SGE (SGE 6.2u4) on both the qmaster
>>> >> & the node? You can run "qconf -help | head -1" on both the master &
>>> >> the node to show the version.
>>> >>
>>> >> Most common GDI errors are due to mismatching versions of Grid Engine
>>> >> - and if you are running the same version of SGE, then let us know, I
>>> >> will dig the code to see what can possibly go wrong.
>>> >>
>>> >> And don't worry about still using the Sun binaries, I work with sites
>>> >> that have even older versions of Grid Engine. Sun contributed the code
>>> >> to open source, and without Sun we wouldn't have this community.
>>> >>
>>> >> Rayson
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> > I hope that I do not offend the users of this list by asking for help
>>> >> > using
>>> >> > a binary installation, using binaries built by Sun.
>>> >> >
>>> >> > I hope that someone can shed some light on the problem.
>>> >> >
>>> >> > I have built some new virtualized clients using KVM on a Centos 6
>>> >> > host.
>>> >> > The
>>> >> > Centos 5 client seems to work properly, but the Centos 4 client does
>>> >> > not. I
>>> >> > need a Centos 4 execd for testing purposes.
>>> >> >
>>> >> > I cannot install sge_execd, because of the qconf problems.
>>> >> >
>>> >> > qconf -sh results in:
>>> >> >
>>> >> > qconf -sh
>>> >> > ERROR: failed receiving gdi request response for mid=1 (got no
>>> >> > message).
>>> >> >
>>> >> > I get this message if I try this client against the new cluster and a
>>> >> > cluster that has been running for several years. Other Centos 4
>>> >> > clients
>>> >> > can
>>> >> > run "qconf -ch" against both clusters without problem.
>>> >> >
>>> >> > qping works from the problematic client:
>>> >> >  qping -info sgemaster 6444 qmaster 1
>>> >> > 01/12/2012 20:59:38:
>>> >> > SIRM version: 0.1
>>> >> > SIRM message id:  1
>>> >> > start time:   01/12/2012 16:31:57 (1326414717)
>>> >> > run time [s]: 16052
>>> >> > messages in read buffer:  0
>>> >> > messages in write buffer: 0
>>> >> > nr. of connected clients: 2
>>> >> > status:   2
>>> >> > info: MAIN: E (16052.50) | signaler000: E
>>> >> > (16052.05)
>>> >> > |
>>> >> > event_master000: E (0.58) | timer000: E (1.58) | worker000: W (41.59)
>>> >> > |
>>> >> > worker001: W (101.61) | listener000: W (5.58) | listener001: W (5.58)
>>> >> > |
>>> >> > scheduler000: W (5.57) | ERROR
>>> >> > malloc:   arena(0) |ordblks(1) | smblks(0) |
>>> >> > hblksr(0) |
>>> >> > hblhkd(0) usmblks(0) | fsmblks(0) | uordblks(0) | fordblks(0) |
>>> >> > keepcost(0)
>>> >> > Monitor:  disabled
>>> >> >
>>> >> > Simon
>>> >> >
>>> >> >
>>> >> >
>>> >> > ___
>>> >> > users mailing list
>>> >> > users@gridengine.org
>>> >> > https://gridengine.org/mailman/listinfo/users
>>> >> >
>>> >
>>> >
>>
>>

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users