Re: [hwloc-users] Solaris and hwloc

Jeff Squyres Thu, 13 Sep 2012 11:09:22 -0400

These are all good points.

That being said, Brock Palen made another good point on the OMPI list recently. 
 It was in regards to OpenFabrics registered memory, but the issue is quite 
analogous.

OMPI used to issue a warning if there wasn't enough registered memory 
available, but allow the job to run anyway (at lower performance).  Brock was 
firmly opposed to that (he's an HPC sysadmin): he didn't want jobs to run at 
all if there wasn't enough registered memory.  

One of the rationale here is that users won't tend to notice a warning at the 
top of a job's stdout/stderr -- if the job ran, that's good enough (until much 
later when they realize that they're not getting the right performance, or, 
worse, this job is impacting other jobs because its affinity is wrong).  But if 
the job doesn't run, that will get noticed immediately, and the problem will be 
fixed by a human.

Hence, it seems safer to fall back on the "if we can't give the user what they 
asked for, fail and let a human figure it out" philosophy.  Even if it means 
changing the default.  Keep in mind that if they run hwloc-bind, they're 
specifically asking for binding.

I think I'm now 80/20 in the "abort hwloc-bind if it fails to bind" camp now.  
:-)

After a little more thought, I'm also thinking that having a "it's ok if 
binding fails" CLI flag is a bad idea.  If the user really wants something to 
run without binding, then you can just do that in the shell:

-----
hwloc-bind ...whatever... my_executable
if test "$?" != "0"; then
        # run without binding
        my_executable
fi
-----

My $0.02.  :)

On Sep 13, 2012, at 4:09 AM, Brice Goglin wrote:

> (resending because the formatting was bad)
> 
> 
> Le 13/09/2012 00:26, Jeff Squyres a écrit :
>> On Sep 12, 2012, at 10:30 AM, Samuel Thibault wrote:
>> 
>>>> Sidenote: if hwloc-bind fails to bind, should we still launch the child 
>>>> process?
>>> Well, it's up to you to decide :)
>> 
>> Anyone have an opinion?  I'm 60/40 in favor of not letting it run, under the 
>> rationale that the user asked for something that we can't deliver, so we 
>> shouldn't continue.
>> 
>> Any idea what numactl does if it can't bind?
> 
> Let me add taskset to the list of tools to compare to, and distinguish
> several cases:
> 
> 1) invalid command line
> * taskset (with invalid list "2,") errors out
> * numactl (with invalid list "2,") errors out
> * hwloc-bind (with invalid location followed by "-- executable") errors
> out (considers the invalid location as the executable name)
> 
> 2) valid command-line containing *only* non-existing objects:
> * taskset errors out
> * numactl errors out
> * hwloc-bind succeeds, binds to nothing
> 
> 3) valid command-line containing some existing objects and some
> non-existing:
> * taskset succeed (ignores unexisting objects, bind to others)
> * numactl errors out
> * hwloc-bind succeeds (ignores unexisting objects, bind to others)
> 
> 4) valid command-line with only valid objects but missing OS support
> * doesn't apply to taskset and numactl afaik
> * hwloc-bind succeeds (ignores failure to bind)
> 
> 
> We have a --strict option, which translate into the STRICT binding flag
> which is documented as
>  "Request strict binding from the OS.  The function will fail if the
> binding can not be guaranteed / completely enforced."
> I usually see "non-strict" as 'if you can't do what I want, do something
> similar". I wouldn't be too bad to say that this applies to (3) (bind to
> smaller than requested).
> 
> But (2) and (4) are different. Not binding at all or binding to nothing
> is far from "non-strict". But I wonder if adding a new command-line flag
> to exit on such errors would be confusing with respect to the existing
> --strict.
> 
> We could also change the default to exit on error, and add --force to
> launch the process even on failure to bind. But changing defaults isn't
> always a good idea.
> 
> Brice
> 

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [hwloc-users] Solaris and hwloc

Reply via email to