[OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-15 Thread Chris Samuel
Hi all,

Not sure if this is a OpenMPI query or a PLPA query,
but given that PLPA seems to have some support for it
already I thought I'd start here. :-)

We run a quad core Opteron cluster with Torque 2.3.x
which uses the kernels cpuset support to constrain
a job to just the cores it has been allocated.

However, we are seeing occasionally that where a job
has been allocated multiple cores on the same node we
get two compute bound MPI processes in the job scheduled
onto the same core (obviously a kernel issue).

So CPU affinity would be an obvious solution, but it
needs to be done with reference to the cores that are
available to it in its cpuset.

This information is already retrievable by PLPA (for
instance "plpa-taskset -cp $$" will retrieve the cores
allocated to the shell you run the command from) but
I'm not sure if OpenMPI makes use of this when binding
CPUs using the linux paffinity MCA parameter ?

Our testing (with 1.3.2) seems to show it doesn't, and
I don't think there are any significant differences with
the snapshots in 1.4.

Am I correct in this ?  If so, are there any plans to
make it do this ?

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Nikolay Molchanov

Hi Eugene,

The FAQ page looks very good!
Some links on the left side do not work, but I assume
they will work tomorrow, when the real page goes alive.

Thanks,
Nik

Eugene Loh wrote:

Zou, Lin (GE, Research, Consultant) wrote:

Hi all,
I want to trace my program, having used vampirTrace to generate  
tracing info, except for Vampir, where can I download free tools to 
parse the tracing info?

Thanks in advance.
Lin
This message appeared on the users list yesterday.  For a long time, 
I've been meaning to add a perf-tool section to the FAQ.  I finally 
did so, incorporating questions and answers from the users and devel 
lists that I've seen on this subject in the last few months.  I just 
put the changes back and as soon as I see the pages "live" I'll 
respond to the user on the user list.  Please take a look.  You can 
make changes as you like or give me feedback and I can do it.


I acknowledge that there is a conflict of interests in my recommending 
Sun MPI Analyzer, but I believe I've done so tastefully and 
appropriately!  Throw tomatoes if you see fit.


P.S.  Until the page goes live, I'll also leave it at 
http://www.osl.iu.edu/~eloh/faq/?category=perftools .  Or, check out a 
workspace.



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Matthias Jurenz
Hi Eugene,

the FAQ page looks very nice.
I just sent the following answer to Lin Zou:

.
for a quick view of what is inside the trace you could try 'otfprofile'
to generate a tex/ps file with some information. This tool is a
component of the latest stand-alone version of the Open Trace Format
(OTF) - see http://www.tu-dresden.de/zih/otf/.

However, if you need more detailed information about the trace you would
need to get a evaluation version of Vampir - see http://www.vampir.eu.
In addition to the evaluation version of Vampir a free version with some
functional limitations will be available in the near future.
.

Could you also mention the tool 'otfprofile' under the section 7,
please? As soon as the free version of Vampir is available this could
also be mentioned.

Thanks,
  Matthias

On Tue, 2009-07-14 at 18:54 -0700, Eugene Loh wrote:
> Zou, Lin (GE, Research, Consultant) wrote: 
> > Hi all, 
> > I want to trace my program, having used vampirTrace to generate
> > tracing info, except for Vampir, where can I download free tools to
> > parse the tracing info?
> > Thanks in advance.
> > Lin
> This message appeared on the users list yesterday.  For a long time,
> I've been meaning to add a perf-tool section to the FAQ.  I finally
> did so, incorporating questions and answers from the users and devel
> lists that I've seen on this subject in the last few months.  I just
> put the changes back and as soon as I see the pages "live" I'll
> respond to the user on the user list.  Please take a look.  You can
> make changes as you like or give me feedback and I can do it.
> 
> I acknowledge that there is a conflict of interests in my recommending
> Sun MPI Analyzer, but I believe I've done so tastefully and
> appropriately!  Throw tomatoes if you see fit.
> 
> P.S.  Until the page goes live, I'll also leave it at
> http://www.osl.iu.edu/~eloh/faq/?category=perftools .  Or, check out a
> workspace.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- 
Matthias Jurenz
Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone : (+49) 351/463-31945
Fax   : (+49) 351/463-37773
e-mail: matthias.jur...@tu-dresden.de
WWW   : http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-15 Thread Ralph Castain
Interesting. No, we don't take PLPA cpu sets into account when  
retrieving the allocation.


Just to be clear: from an OMPI perspective, I don't think this is an  
issue of binding, but rather an issue of allocation. If we knew we had  
been allocated only a certain number of cores on a node, then we would  
only map that many procs to the node. When we subsequently "bind", we  
should then bind those procs to the correct cores (I think).


Could you check this? You can run a trivial job using the -npernode x  
option, where x matched the #cores you were allocated on the nodes.


If you do this, do we bind to the correct cores?

If we do, then that would confirm that we just aren't picking up the  
right number of cores allocated to us. If it is wrong, then this is a  
PLPA issue where it isn't binding to the right core.


Thanks
Ralph

On Jul 15, 2009, at 12:28 AM, Chris Samuel wrote:


Hi all,

Not sure if this is a OpenMPI query or a PLPA query,
but given that PLPA seems to have some support for it
already I thought I'd start here. :-)

We run a quad core Opteron cluster with Torque 2.3.x
which uses the kernels cpuset support to constrain
a job to just the cores it has been allocated.

However, we are seeing occasionally that where a job
has been allocated multiple cores on the same node we
get two compute bound MPI processes in the job scheduled
onto the same core (obviously a kernel issue).

So CPU affinity would be an obvious solution, but it
needs to be done with reference to the cores that are
available to it in its cpuset.

This information is already retrievable by PLPA (for
instance "plpa-taskset -cp $$" will retrieve the cores
allocated to the shell you run the command from) but
I'm not sure if OpenMPI makes use of this when binding
CPUs using the linux paffinity MCA parameter ?

Our testing (with 1.3.2) seems to show it doesn't, and
I don't think there are any significant differences with
the snapshots in 1.4.

Am I correct in this ?  If so, are there any plans to
make it do this ?

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Jeff Squyres

On Jul 15, 2009, at 6:17 AM, Matthias Jurenz wrote:


the FAQ page looks very nice.



Ditto -- thanks for doing it, Eugene!


I just sent the following answer to Lin Zou:



Did that go on-list?  It would be good to see that stuff in the  
publicly-searchable web archives.  I mention this because our Google  
Analytics clearly show that lots of people are searching our mailing  
list, looking for answers to their questions.



Could you also mention the tool 'otfprofile' under the section 7,
please? As soon as the free version of Vampir is available this could
also be mentioned.




Do you guys not have write access to the SVN repo for the web pages?   
If not, we should just add you -- that would certainly make it easier...


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [RFC] Move the datatype engine in the OPAL layer

2009-07-15 Thread Jeff Squyres

On Jul 14, 2009, at 1:23 PM, Rainer Keller wrote:


https://svn.open-mpi.org/trac/ompi/wiki/HowtoTesting



That is most helpful -- thanks!

What about the latency issue?

> >> Performance tests on the ompi-ddt branch have proven that there  
is no
> >> performance penalties associated with this change (tests done  
using

> >> NetPipe-3.7.1 on smoky using BTL/sm, giving 1.6usecs on this
> >> platform).
> >
> > 1.6us sounds like pretty high sm latency...  Is this a slow  
platform?





--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-15 Thread Chris Samuel

- "Ralph Castain"  wrote:

Hi Ralph,

> Interesting. No, we don't take PLPA cpu sets into account when  
> retrieving the allocation.

Understood.

> Just to be clear: from an OMPI perspective, I don't think this is an 
> issue of binding, but rather an issue of allocation. If we knew we had
> been allocated only a certain number of cores on a node, then we would
> only map that many procs to the node. When we subsequently "bind", we 
> should then bind those procs to the correct cores (I think).

Hmm, OpenMPI should already know this from the PBS TM API when
launching the job, we've never had to get our users to specify
how many procs per node to start (and they will generally have
no idea how many to ask for in advance as they are at the mercy
of the scheduler, unless they select a whole nodes with ppn=8).

> Could you check this? You can run a trivial job using the
> -npernode x option, where x matched the #cores you were
> allocated on the nodes.
>
> If you do this, do we bind to the correct cores?

I'll give this a shot tomorrow when I'm back in the office
(just checking email late at night here), I'll try it under
strace to to see what it tries to sched_setaffinity() to.

> If we do, then that would confirm that we just aren't
> picking up the right number of cores allocated to us.
> If it is wrong, then this is a PLPA issue where it
> isn't binding to the right core.

Interesting, will let you know the test results tomorrow!

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Ashley Pittman
On Tue, 2009-07-14 at 18:54 -0700, Eugene Loh wrote:
> P.S.  Until the page goes live, I'll also leave it at
> http://www.osl.iu.edu/~eloh/faq/?category=perftools .  Or, check out a
> workspace.

I'm happy with it.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Matthias Jurenz
Hi Jeff,

On Wed, 2009-07-15 at 07:13 -0400, Jeff Squyres wrote:
> On Jul 15, 2009, at 6:17 AM, Matthias Jurenz wrote:
> 
> > the FAQ page looks very nice.
> >
> 
> Ditto -- thanks for doing it, Eugene!
> 
> > I just sent the following answer to Lin Zou:
> >
> 
> Did that go on-list?  It would be good to see that stuff in the  
> publicly-searchable web archives.  I mention this because our Google  
> Analytics clearly show that lots of people are searching our mailing  
> list, looking for answers to their questions.
> 

I sent the answer directly to the user, 'cause I didn't subscribe to the
user-list. I'll do that asap ;-)

> > Could you also mention the tool 'otfprofile' under the section 7,
> > please? As soon as the free version of Vampir is available this could
> > also be mentioned.
> >
> 
> 
> Do you guys not have write access to the SVN repo for the web pages?   
> If not, we should just add you -- that would certainly make it easier...
> 

Unfortunately, we haven't write access to the repository for the web
pages. Could you add us (me and Andreas), please?

Thanks,
  Matthias


smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] selectively bind MPI to one HCA out of available ones

2009-07-15 Thread neeraj
Hi all,

I have a cluster where both HCA's of blade are active, but 
connected to different subnet.
Is there an option in MPI to select one HCA out of available 
one's? I know it can be done by making changes in openmpi code, but i need 
clean interface like option during mpi launch time to select mthca0 or 
mthca1?

Any help is appreciated. Btw i just checked Mvapich and feature is 
there inside.

Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634

=-=-=



Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. 

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.Thank you

=-=-=


Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-15 Thread Ralph Castain
Hmmm...I believe I made a mis-statement. Shocking to those who know me, I am
sure! :-)

Just to correct my comments: OMPI knows how many "slots" have been allocated
to us, but not which "cores". So I'll assign the correct number of procs to
each node, but they won't know that we were allocated cores 2 and 4 (for
example), as opposed to some other combination.

When we subsequently bind, we bind to logical cpus based on our node rank -
i.e., what rank I am relative to my local peers on this node. PLPA then
translates that into a physical core.

My guess is that you are correct and PLPA isn't looking to see specifically
-which- cores were allocated to the job, but instead is simply translating
logical cpu=0 to the first physical core in the node.

The test I asked you to run, though, will confirm this. Please do let us
know as this is definitely something we should fix.

Thanks!
Ralph


On Wed, Jul 15, 2009 at 6:11 AM, Chris Samuel  wrote:

>
> - "Ralph Castain"  wrote:
>
> Hi Ralph,
>
> > Interesting. No, we don't take PLPA cpu sets into account when
> > retrieving the allocation.
>
> Understood.
>
> > Just to be clear: from an OMPI perspective, I don't think this is an
> > issue of binding, but rather an issue of allocation. If we knew we had
> > been allocated only a certain number of cores on a node, then we would
> > only map that many procs to the node. When we subsequently "bind", we
> > should then bind those procs to the correct cores (I think).
>
> Hmm, OpenMPI should already know this from the PBS TM API when
> launching the job, we've never had to get our users to specify
> how many procs per node to start (and they will generally have
> no idea how many to ask for in advance as they are at the mercy
> of the scheduler, unless they select a whole nodes with ppn=8).


>
> > Could you check this? You can run a trivial job using the
> > -npernode x option, where x matched the #cores you were
> > allocated on the nodes.
> >
> > If you do this, do we bind to the correct cores?
>
> I'll give this a shot tomorrow when I'm back in the office
> (just checking email late at night here), I'll try it under
> strace to to see what it tries to sched_setaffinity() to.
>
> > If we do, then that would confirm that we just aren't
> > picking up the right number of cores allocated to us.
> > If it is wrong, then this is a PLPA issue where it
> > isn't binding to the right core.
>
> Interesting, will let you know the test results tomorrow!
>
> cheers,
> Chris
> --
> Christopher Samuel - (03) 9925 4751 - Systems Manager
>  The Victorian Partnership for Advanced Computing
>  P.O. Box 201, Carlton South, VIC 3053, Australia
> VPAC is a not-for-profit Registered Research Agency
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Jeff Squyres

On Jul 15, 2009, at 8:57 AM, Matthias Jurenz wrote:

I sent the answer directly to the user, 'cause I didn't subscribe to  
the

user-list. I'll do that asap ;-)



Thanks -- I appreciate it.  I know it's a somewhat high-volume list.   
I can bounce you the original question so that you can reply to it and  
have it threaded properly.



> > Could you also mention the tool 'otfprofile' under the section 7,
> > please? As soon as the free version of Vampir is available this  
could

> > also be mentioned.
>
> Do you guys not have write access to the SVN repo for the web pages?
> If not, we should just add you -- that would certainly make it  
easier...


Unfortunately, we haven't write access to the repository for the web
pages. Could you add us (me and Andreas), please?




Will do -- can you remind me of your SVN ID's again?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Jeff Squyres

On Jul 15, 2009, at 10:24 AM, Jeff Squyres (jsquyres) wrote:


Thanks -- I appreciate it.  I know it's a somewhat high-volume list.
I can bounce you the original question so that you can reply to it and
have it threaded properly.



Disregard -- you replied already.  Many thanks!

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Matthias Jurenz
On Wed, 2009-07-15 at 10:24 -0400, Jeff Squyres wrote:
> On Jul 15, 2009, at 8:57 AM, Matthias Jurenz wrote:
> 
> > I sent the answer directly to the user, 'cause I didn't subscribe to  
> > the
> > user-list. I'll do that asap ;-)
> >
> 
> Thanks -- I appreciate it.  I know it's a somewhat high-volume list.   
> I can bounce you the original question so that you can reply to it and  
> have it threaded properly.
> 
> > > > Could you also mention the tool 'otfprofile' under the section 7,
> > > > please? As soon as the free version of Vampir is available this  
> > could
> > > > also be mentioned.
> > >
> > > Do you guys not have write access to the SVN repo for the web pages?
> > > If not, we should just add you -- that would certainly make it  
> > easier...
> >
> > Unfortunately, we haven't write access to the repository for the web
> > pages. Could you add us (me and Andreas), please?
> >
> 
> 
> Will do -- can you remind me of your SVN ID's again?
> 

Sure! Our SVN ID's are:
jurenz
and
knuepfer



smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Jeff Squyres

On Jul 15, 2009, at 10:37 AM, Matthias Jurenz wrote:


Sure! Our SVN ID's are:
jurenz
and
knuepfer



Done!  You should have write access -- let me know if you don't.

I think you guys have seen it before, but here's the wiki page about  
adding / editing wiki pages:


https://svn.open-mpi.org/trac/ompi/wiki/OMPIFAQEntries

Eugene recently added a bunch of good stuff in there.

--
Jeff Squyres
Cisco Systems



[OMPI devel] Fwd: [all-osl-users] Upgrading of the OSL SVN server

2009-07-15 Thread Jeff Squyres

FYI.

Begin forwarded message:


From: "DongInn Kim" 
Date: July 15, 2009 10:39:01 AM EDT
To: 
Subject: Re: [all-osl-users] Upgrading of the OSL SVN server

I am sorry that we can not upgrade subversion this time because of  
the technical issues on the interaction between the new subversion  
and SourceHaven web application.


Once this issue is cleared, I will make another schedule to upgrade  
subversion. Until then, we will use the old version(ver 1.4.2) of  
subversion like we did before.


The servers are up and running with subversion-1.4.2 now.

Best Regards,

- DongInn

On 7/13/09 10:20 AM, Kim, DongInn wrote:
> Hi,
>
> The new version(1.6.3) of subversion was released on June 2009. It  
has a lot of good features included and many bugs are fixed.

> http://subversion.tigris.org/servlets/ProjectNewsList
>
> The OSL would like to upgrade the current subversion(1.4.2) to get  
the benefit of the new version.

> The upgrade would start at 8:00AM(E.T.) on July 15, 2009.
>
> The subversion service and trac websites services would NOT be  
available during the following time period.

> - 5:00am-11:00am Pacific US time
> - 6:00am-12:00pm Mountain US time
> - 7:00am-1:00pm Central US time
> - 8:00am-2:00pm Eastern US time
> - 12:00pm-6:00pm GMT
>
> Please let me know if you have any concerns or questions about  
this upgrade.

>
> Regards,
>





--
Jeff Squyres
Cisco Systems



[OMPI devel] Fwd: [all-osl-users] Upgrading of the OSL SVN server

2009-07-15 Thread Josh Hursey

FYI.


Begin forwarded message:


From: DongInn Kim 
Date: July 15, 2009 10:39:01 AM EDT
To: all-osl-us...@osl.iu.edu
Subject: Re: [all-osl-users] Upgrading of the OSL SVN server

I am sorry that we can not upgrade subversion this time because of  
the technical issues on the interaction between the new subversion  
and SourceHaven web application.


Once this issue is cleared, I will make another schedule to upgrade  
subversion. Until then, we will use the old version(ver 1.4.2) of  
subversion like we did before.


The servers are up and running with subversion-1.4.2 now.

Best Regards,

- DongInn

On 7/13/09 10:20 AM, Kim, DongInn wrote:

Hi,

The new version(1.6.3) of subversion was released on June 2009. It  
has a lot of good features included and many bugs are fixed.

http://subversion.tigris.org/servlets/ProjectNewsList

The OSL would like to upgrade the current subversion(1.4.2) to get  
the benefit of the new version.

The upgrade would start at 8:00AM(E.T.) on July 15, 2009.

The subversion service and trac websites services would NOT be  
available during the following time period.

- 5:00am-11:00am Pacific US time
- 6:00am-12:00pm Mountain US time
- 7:00am-1:00pm Central US time
- 8:00am-2:00pm Eastern US time
- 12:00pm-6:00pm GMT

Please let me know if you have any concerns or questions about this  
upgrade.


Regards,





Re: [OMPI devel] Fwd: [all-osl-users] Upgrading of the OSL SVN server

2009-07-15 Thread Holger Mickler
*Quickness competition round 1*

Jeff vs. Josh
  1   :   0

;-))



Josh Hursey wrote:
> FYI.
> 
> 
> Begin forwarded message:
> 
>> From: DongInn Kim 
>> Date: July 15, 2009 10:39:01 AM EDT
>> To: all-osl-us...@osl.iu.edu
>> Subject: Re: [all-osl-users] Upgrading of the OSL SVN server
>>
>> I am sorry that we can not upgrade subversion this time because of the
>> technical issues on the interaction between the new subversion and
>> SourceHaven web application.
>>
>> Once this issue is cleared, I will make another schedule to upgrade
>> subversion. Until then, we will use the old version(ver 1.4.2) of
>> subversion like we did before.
>>
>> The servers are up and running with subversion-1.4.2 now.
>>
>> Best Regards,
>>
>> - DongInn
>>
>> On 7/13/09 10:20 AM, Kim, DongInn wrote:
>>> Hi,
>>>
>>> The new version(1.6.3) of subversion was released on June 2009. It
>>> has a lot of good features included and many bugs are fixed.
>>> http://subversion.tigris.org/servlets/ProjectNewsList
>>>
>>> The OSL would like to upgrade the current subversion(1.4.2) to get
>>> the benefit of the new version.
>>> The upgrade would start at 8:00AM(E.T.) on July 15, 2009.
>>>
>>> The subversion service and trac websites services would NOT be
>>> available during the following time period.
>>> - 5:00am-11:00am Pacific US time
>>> - 6:00am-12:00pm Mountain US time
>>> - 7:00am-1:00pm Central US time
>>> - 8:00am-2:00pm Eastern US time
>>> - 12:00pm-6:00pm GMT
>>>
>>> Please let me know if you have any concerns or questions about this
>>> upgrade.
>>>
>>> Regards,
>>>
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

-- 
Holger Mickler

Technische Universität Dresden
Center for Information Services
and High Performance Computing (ZIH)
01062 Dresden
Germany

Contact
Room:   Willers-Bau A306
Phone:  +49 351 463-37903
Fax:+49 351 463-37773
email:  holger.mick...@tu-dresden.de




[OMPI devel] DDT and spawn issue?

2009-07-15 Thread Jeff Squyres
I [very briefly] read about the DDT spawn issues, so I went to look at  
ompi/op/op.c.  I notice that there's a new comment above the op  
datatype<-->op map construction area that says:


/* XXX TODO */

svn blame says:

 21641   rusraink /* XXX TODO */

r21641 is the big merge from the past weekend where the DDT split came  
in.


Has this area been looked at and the comment is out of date?  Or does  
it need to be updated with new mappings?  (I honestly have not looked  
any farther than this -- the new comment caught my eye)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] DDT and spawn issue?

2009-07-15 Thread George Bosilca
Yes, this appears to be at least partially part of the problem Edgar  
is seeing. We're trying to figure out how most of the tests passed so  
far with a wrong mapping. Interesting enough, while the mapping seems  
wrong the lookup is symmetric so most of the time we end-up with the  
correct op by pure luck.


We're looking into this.

  george.

On Jul 15, 2009, at 11:50 , Jeff Squyres wrote:

I [very briefly] read about the DDT spawn issues, so I went to look  
at ompi/op/op.c.  I notice that there's a new comment above the op  
datatype<-->op map construction area that says:


   /* XXX TODO */

svn blame says:

21641   rusraink /* XXX TODO */

r21641 is the big merge from the past weekend where the DDT split  
came in.


Has this area been looked at and the comment is out of date?  Or  
does it need to be updated with new mappings?  (I honestly have not  
looked any farther than this -- the new comment caught my eye)


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] DDT and spawn issue?

2009-07-15 Thread Ralph Castain
Thanks George!!

On Wed, Jul 15, 2009 at 9:57 AM, George Bosilca wrote:

> Yes, this appears to be at least partially part of the problem Edgar is
> seeing. We're trying to figure out how most of the tests passed so far with
> a wrong mapping. Interesting enough, while the mapping seems wrong the
> lookup is symmetric so most of the time we end-up with the correct op by
> pure luck.
>
> We're looking into this.
>
>  george.
>
>
> On Jul 15, 2009, at 11:50 , Jeff Squyres wrote:
>
>  I [very briefly] read about the DDT spawn issues, so I went to look at
>> ompi/op/op.c.  I notice that there's a new comment above the op
>> datatype<-->op map construction area that says:
>>
>>   /* XXX TODO */
>>
>> svn blame says:
>>
>> 21641   rusraink /* XXX TODO */
>>
>> r21641 is the big merge from the past weekend where the DDT split came in.
>>
>> Has this area been looked at and the comment is out of date?  Or does it
>> need to be updated with new mappings?  (I honestly have not looked any
>> farther than this -- the new comment caught my eye)
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] DDT and spawn issue?

2009-07-15 Thread Rainer Keller
Hi Jeff,
Ralph and Edgar send fwd an email about this.
We (George and myselve) are currently looking into this.

With the changes we have I can get IBM/spawn to work "sometimes", aka 
sometimes, it segfaults.

Thanks,
Rainer




On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
> I [very briefly] read about the DDT spawn issues, so I went to look at
> ompi/op/op.c.  I notice that there's a new comment above the op
> datatype<-->op map construction area that says:
>
>  /* XXX TODO */
>
> svn blame says:
>
>   21641   rusraink /* XXX TODO */
>
> r21641 is the big merge from the past weekend where the DDT split came
> in.
>
> Has this area been looked at and the comment is out of date?  Or does
> it need to be updated with new mappings?  (I honestly have not looked
> any farther than this -- the new comment caught my eye)

-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink




Re: [OMPI devel] [OMPI users] where can i get a tracing tool

2009-07-15 Thread Eugene Loh
Done.  Hit "reload" on the URL below, check out an SVN repository, or 
wait for these changes to be pushed to the live site.


Matthias Jurenz wrote:


Could you also mention the tool 'otfprofile' under the section 7,
please?

On Tue, 2009-07-14 at 18:54 -0700, Eugene Loh wrote:
 


P.S.  Until the page goes live, I'll also leave it at
http://www.osl.iu.edu/~eloh/faq/?category=perftools .





Re: [OMPI devel] DDT and spawn issue?

2009-07-15 Thread Jeff Squyres
Perhaps we should add a requirement for testing on 2-3 different  
systems before long-term (or "big change") branches like this come to  
the trunk?  I say this because it seems like at least some of these  
problems were based on bad luck -- i.e., the stuff worked on the  
platform that it was being tested and developed on, even though there  
are bugs left.  Having fallen victim to this myself many times  
("worked for me on Cisco machines!  I dunno why it's failing for  
you... :-("), I think we all recognize the value of just running the  
same code on someone else's systems -- it has a good tendency to turn  
up issues that don't show up on yours.  I'm not trying to say that  
every little trunk commit needs to be validated -- but "big" changes  
like this could certainly benefit from multiple validations.


Cisco is very willing to be a 2nd platform for testing for stuff that  
we can run without too much trouble, especially via MTT (e.g., I  
already have the right kind of networks to test, etc.).


BTW, is anyone going to comment about the latency issue that I asked  
about?


(in case you can't tell, I'm moderately displeased about how this  
whole branch came to the trunk... :-\ )




On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote:


Hi Jeff,
Ralph and Edgar send fwd an email about this.
We (George and myselve) are currently looking into this.

With the changes we have I can get IBM/spawn to work "sometimes", aka
sometimes, it segfaults.

Thanks,
Rainer




On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
> I [very briefly] read about the DDT spawn issues, so I went to  
look at

> ompi/op/op.c.  I notice that there's a new comment above the op
> datatype<-->op map construction area that says:
>
>  /* XXX TODO */
>
> svn blame says:
>
>   21641   rusraink /* XXX TODO */
>
> r21641 is the big merge from the past weekend where the DDT split  
came

> in.
>
> Has this area been looked at and the comment is out of date?  Or  
does
> it need to be updated with new mappings?  (I honestly have not  
looked

> any farther than this -- the new comment caught my eye)

--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink






--
Jeff Squyres
Cisco Systems



[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686

2009-07-15 Thread George Bosilca
I have a question regarding the mapping. How can I declare a partial  
mapping ? In fact I only care about how some of the processes are  
mapped on some specific nodes. Right now if the rmaps doesn't contain  
information about all nodes, we give up (before this patch we  
segfaulted).


Does it means we always have to declare the whole mapping or it's just  
that we overlooked this strange case?


  george.

Begin forwarded message:


Author: bosilca
Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
New Revision: 21686
URL: https://svn.open-mpi.org/trac/ompi/changeset/21686

Log:
Reorder the nidmap encoding function. Add a check to make sure we  
don't write

outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file  
containing only
partial information the num_procs get set to the wrong value (the  
number of
hosts in the rmaps file instead of the number of processes requested  
on the

command line).




Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686

2009-07-15 Thread Ralph Castain
The routed comm system relies on each daemon having complete information as
to where every process is located, so the expectation was that only full
maps would ever be sent. Thus, the nidmap code is setup to always send a
full map.

I don't know how to even generate a "partial" map. I assume you are doing
something offline? Is this to update changed info? If so, you'll also have
to do something to update the daemon's maps or the comm system will break
down.

Ralph

On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca wrote:

> I have a question regarding the mapping. How can I declare a partial
> mapping ? In fact I only care about how some of the processes are mapped on
> some specific nodes. Right now if the rmaps doesn't contain information
> about all nodes, we give up (before this patch we segfaulted).
>
> Does it means we always have to declare the whole mapping or it's just that
> we overlooked this strange case?
>
>  george.
>
> Begin forwarded message:
>
>  Author: bosilca
>> Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
>> New Revision: 21686
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/21686
>>
>> Log:
>> Reorder the nidmap encoding function. Add a check to make sure we don't
>> write
>> outside the boundaries of the allocated array.
>>
>> However, the problem is still there. If we have rmaps file containing only
>> partial information the num_procs get set to the wrong value (the number
>> of
>> hosts in the rmaps file instead of the number of processes requested on
>> the
>> command line).
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] DDT and spawn issue?

2009-07-15 Thread George Bosilca
Actually I don't think this will help. I looked on MTT and there are  
no errors related to this (logically all reductions should have  
failed) ... and MTT is supposed to run on several platforms. What  
happens inside is really strange, but as we do the same mistake when  
we look-up the op as hen we store it, this works on most cases.  
Moreover, even with the op corrected we still see segfaults, and it  
looks more and more as some memory overwrite problem... Before the  
commit we even test it on a Sicortex machine (which is clearly a  
different architecture than the x86_64) and this didn't trigger any  
errors either.


Regarding the latency issue, there is not much to say about. The  
platform we tested on is clearly older than what other people test on,  
but this is all about. The two versions (before and after the data- 
type move) have the same latency, there is no reason to focus on the  
latency number.


  george.


On Jul 15, 2009, at 12:18 , Jeff Squyres wrote:

Perhaps we should add a requirement for testing on 2-3 different  
systems before long-term (or "big change") branches like this come  
to the trunk?  I say this because it seems like at least some of  
these problems were based on bad luck -- i.e., the stuff worked on  
the platform that it was being tested and developed on, even though  
there are bugs left.  Having fallen victim to this myself many times  
("worked for me on Cisco machines!  I dunno why it's failing for  
you... :-("), I think we all recognize the value of just running the  
same code on someone else's systems -- it has a good tendency to  
turn up issues that don't show up on yours.  I'm not trying to say  
that every little trunk commit needs to be validated -- but "big"  
changes like this could certainly benefit from multiple validations.


Cisco is very willing to be a 2nd platform for testing for stuff  
that we can run without too much trouble, especially via MTT (e.g.,  
I already have the right kind of networks to test, etc.).


BTW, is anyone going to comment about the latency issue that I asked  
about?


(in case you can't tell, I'm moderately displeased about how this  
whole branch came to the trunk... :-\ )




On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote:


Hi Jeff,
Ralph and Edgar send fwd an email about this.
We (George and myselve) are currently looking into this.

With the changes we have I can get IBM/spawn to work "sometimes", aka
sometimes, it segfaults.

Thanks,
Rainer




On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
> I [very briefly] read about the DDT spawn issues, so I went to  
look at

> ompi/op/op.c.  I notice that there's a new comment above the op
> datatype<-->op map construction area that says:
>
>  /* XXX TODO */
>
> svn blame says:
>
>   21641   rusraink /* XXX TODO */
>
> r21641 is the big merge from the past weekend where the DDT split  
came

> in.
>
> Has this area been looked at and the comment is out of date?  Or  
does
> it need to be updated with new mappings?  (I honestly have not  
looked

> any farther than this -- the new comment caught my eye)

--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink






--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686

2009-07-15 Thread George Bosilca
I think I found a better solution (in r21688). Here is what I was  
trying to do.


I have a more or less homogeneous cluster. In fact all processors are  
identical, except that some are quad core and some dual core. Of  
course I care how my processes are mapped on the quad cores, but not  
really on the dual cores.


My approach was to use the following configuration files.

In /home/bosilca/.openmpi/mca-params.conf I have:

orte_default_hostfile=/home/bosilca/.openmpi/machinefile
rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile
rmaps_rank_file_priority = 100

In /home/bosilca/.openmpi/machinefile I have the full description of  
the cluster. As an example:

node01 slots=4
node02 slots=4
node03 slots=2
node04 slots=2

And in the /home/bosilca/.openmpi/rankfile file I have:
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=0
rank 3=+n1 slot=1

As long as I spawn jobs with less than 4 processes everything worked  
fine. But when I used more than 4 processes, orterun segfaulted. After  
debugging I found that the nodes, lrank and nrank arrays were  
allocated based on the jdata->num_procs, but then filled based on the  
total number of processes in the jdata->nodes array. As it appears  
that the jdata->num_procs is somehow modified based on the number of  
entries in the rankfile, we end-up writing outside the allocation and  
then segfault. Now with the latest patch, we can cope with such a  
scenario by only packing the known information (and thus not writing  
outside the allocated arrays).


This might not be the best approach, but it is doing what I'm looking  
for ...


  george.

On Jul 15, 2009, at 15:50 , Ralph Castain wrote:

The routed comm system relies on each daemon having complete  
information as to where every process is located, so the expectation  
was that only full maps would ever be sent. Thus, the nidmap code is  
setup to always send a full map.


I don't know how to even generate a "partial" map. I assume you are  
doing something offline? Is this to update changed info? If so,  
you'll also have to do something to update the daemon's maps or the  
comm system will break down.


Ralph

On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca  
 wrote:
I have a question regarding the mapping. How can I declare a partial  
mapping ? In fact I only care about how some of the processes are  
mapped on some specific nodes. Right now if the rmaps doesn't  
contain information about all nodes, we give up (before this patch  
we segfaulted).


Does it means we always have to declare the whole mapping or it's  
just that we overlooked this strange case?


 george.

Begin forwarded message:


Author: bosilca
Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
New Revision: 21686
URL: https://svn.open-mpi.org/trac/ompi/changeset/21686

Log:
Reorder the nidmap encoding function. Add a check to make sure we  
don't write

outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file  
containing only
partial information the num_procs get set to the wrong value (the  
number of
hosts in the rmaps file instead of the number of processes requested  
on the

command line).

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686

2009-07-15 Thread Ralph Castain
Ah - interesting scenario!

Definitely a "bug" in the code, then. What it looks like, though, is that
the jdata->num_procs is wrong. There shouldn't be any way that the num_procs
in the node array is different than jdata->num_procs.

My guess is that the rank_file mapper isn't correctly maintaining the
bookkeeping when we map the procs beyond those in the rankfile. I'll dig
into it - have to fix something for Lenny anyway.

Meantime, this change looks fine regardless as it (a) is better code and (b)
protects us against such errors.

Thanks for catching it!
Ralph


On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca wrote:

> I think I found a better solution (in r21688). Here is what I was trying to
> do.
>
> I have a more or less homogeneous cluster. In fact all processors are
> identical, except that some are quad core and some dual core. Of course I
> care how my processes are mapped on the quad cores, but not really on the
> dual cores.
>
> My approach was to use the following configuration files.
>
> In /home/bosilca/.openmpi/mca-params.conf I have:
>
> orte_default_hostfile=/home/bosilca/.openmpi/machinefile
> rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile
> rmaps_rank_file_priority = 100
>
> In /home/bosilca/.openmpi/machinefile I have the full description of the
> cluster. As an example:
> node01 slots=4
> node02 slots=4
> node03 slots=2
> node04 slots=2
>
> And in the /home/bosilca/.openmpi/rankfile file I have:
> rank 0=+n0 slot=0
> rank 1=+n0 slot=1
> rank 2=+n1 slot=0
> rank 3=+n1 slot=1
>
> As long as I spawn jobs with less than 4 processes everything worked fine.
> But when I used more than 4 processes, orterun segfaulted. After debugging I
> found that the nodes, lrank and nrank arrays were allocated based on the
> jdata->num_procs, but then filled based on the total number of processes in
> the jdata->nodes array. As it appears that the jdata->num_procs is somehow
> modified based on the number of entries in the rankfile, we end-up writing
> outside the allocation and then segfault. Now with the latest patch, we can
> cope with such a scenario by only packing the known information (and thus
> not writing outside the allocated arrays).
>
> This might not be the best approach, but it is doing what I'm looking for
> ...
>
>  george.
>
>
> On Jul 15, 2009, at 15:50 , Ralph Castain wrote:
>
>  The routed comm system relies on each daemon having complete information
>> as to where every process is located, so the expectation was that only full
>> maps would ever be sent. Thus, the nidmap code is setup to always send a
>> full map.
>>
>> I don't know how to even generate a "partial" map. I assume you are doing
>> something offline? Is this to update changed info? If so, you'll also have
>> to do something to update the daemon's maps or the comm system will break
>> down.
>>
>> Ralph
>>
>> On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca 
>> wrote:
>> I have a question regarding the mapping. How can I declare a partial
>> mapping ? In fact I only care about how some of the processes are mapped on
>> some specific nodes. Right now if the rmaps doesn't contain information
>> about all nodes, we give up (before this patch we segfaulted).
>>
>> Does it means we always have to declare the whole mapping or it's just
>> that we overlooked this strange case?
>>
>>  george.
>>
>> Begin forwarded message:
>>
>>
>> Author: bosilca
>> Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
>> New Revision: 21686
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/21686
>>
>> Log:
>> Reorder the nidmap encoding function. Add a check to make sure we don't
>> write
>> outside the boundaries of the allocated array.
>>
>> However, the problem is still there. If we have rmaps file containing only
>> partial information the num_procs get set to the wrong value (the number
>> of
>> hosts in the rmaps file instead of the number of processes requested on
>> the
>> command line).
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686

2009-07-15 Thread Ralph Castain
Found the bug - we indeed failed to update the jdata->num_procs field  
when adding the non-rf-mapped procs to the job.


Fix coming shortly.

On Jul 15, 2009, at 2:40 PM, Ralph Castain wrote:


Ah - interesting scenario!

Definitely a "bug" in the code, then. What it looks like, though, is  
that the jdata->num_procs is wrong. There shouldn't be any way that  
the num_procs in the node array is different than jdata->num_procs.


My guess is that the rank_file mapper isn't correctly maintaining  
the bookkeeping when we map the procs beyond those in the rankfile.  
I'll dig into it - have to fix something for Lenny anyway.


Meantime, this change looks fine regardless as it (a) is better code  
and (b) protects us against such errors.


Thanks for catching it!
Ralph


On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca  
 wrote:
I think I found a better solution (in r21688). Here is what I was  
trying to do.


I have a more or less homogeneous cluster. In fact all processors  
are identical, except that some are quad core and some dual core. Of  
course I care how my processes are mapped on the quad cores, but not  
really on the dual cores.


My approach was to use the following configuration files.

In /home/bosilca/.openmpi/mca-params.conf I have:

orte_default_hostfile=/home/bosilca/.openmpi/machinefile
rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile
rmaps_rank_file_priority = 100

In /home/bosilca/.openmpi/machinefile I have the full description of  
the cluster. As an example:

node01 slots=4
node02 slots=4
node03 slots=2
node04 slots=2

And in the /home/bosilca/.openmpi/rankfile file I have:
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=0
rank 3=+n1 slot=1

As long as I spawn jobs with less than 4 processes everything worked  
fine. But when I used more than 4 processes, orterun segfaulted.  
After debugging I found that the nodes, lrank and nrank arrays were  
allocated based on the jdata->num_procs, but then filled based on  
the total number of processes in the jdata->nodes array. As it  
appears that the jdata->num_procs is somehow modified based on the  
number of entries in the rankfile, we end-up writing outside the  
allocation and then segfault. Now with the latest patch, we can cope  
with such a scenario by only packing the known information (and thus  
not writing outside the allocated arrays).


This might not be the best approach, but it is doing what I'm  
looking for ...


 george.


On Jul 15, 2009, at 15:50 , Ralph Castain wrote:

The routed comm system relies on each daemon having complete  
information as to where every process is located, so the expectation  
was that only full maps would ever be sent. Thus, the nidmap code is  
setup to always send a full map.


I don't know how to even generate a "partial" map. I assume you are  
doing something offline? Is this to update changed info? If so,  
you'll also have to do something to update the daemon's maps or the  
comm system will break down.


Ralph

On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca  
 wrote:
I have a question regarding the mapping. How can I declare a partial  
mapping ? In fact I only care about how some of the processes are  
mapped on some specific nodes. Right now if the rmaps doesn't  
contain information about all nodes, we give up (before this patch  
we segfaulted).


Does it means we always have to declare the whole mapping or it's  
just that we overlooked this strange case?


 george.

Begin forwarded message:


Author: bosilca
Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
New Revision: 21686
URL: https://svn.open-mpi.org/trac/ompi/changeset/21686

Log:
Reorder the nidmap encoding function. Add a check to make sure we  
don't write

outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file  
containing only
partial information the num_procs get set to the wrong value (the  
number of
hosts in the rmaps file instead of the number of processes requested  
on the

command line).

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] MPI_Accumulate() with MPI_PROC_NULL target rank

2009-07-15 Thread Lisandro Dalcin
The MPI 2-1 standard says:

"MPI_PROC_NULL is a valid target rank in the MPI RMA calls
MPI_ACCUMULATE, MPI_GET, and MPI_PUT. The effect is the same as for
MPI_PROC_NULL in MPI point-to-point communication. After any RMA
operation with rank MPI_PROC_NULL, it is still necessary to finish the
RMA epoch with the synchronization method that started the epoch."


Unfortunately, MPI_Accumulate() is not quite the same as
point-to-point, as a reduction is involved. Suppose you make this call
(let me abuse and use keyword arguments):

MPI_Accumulate(..., target_rank=MPI_PROC_NULL,
target_datatype=MPI_BYTE, op=MPI_SUM, ...)

IIUC, the call fails (with MPI_ERR_OP) in Open MPI because MPI_BYTE is
an invalid datatype for MPI_SUM.

But provided that the target rank is MPI_PROC_NULL, would it make
sense for the call to success?


-- 
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



Re: [OMPI devel] MPI_Accumulate() with MPI_PROC_NULL target rank

2009-07-15 Thread Brian W. Barrett

On Wed, 15 Jul 2009, Lisandro Dalcin wrote:


The MPI 2-1 standard says:

"MPI_PROC_NULL is a valid target rank in the MPI RMA calls
MPI_ACCUMULATE, MPI_GET, and MPI_PUT. The effect is the same as for
MPI_PROC_NULL in MPI point-to-point communication. After any RMA
operation with rank MPI_PROC_NULL, it is still necessary to finish the
RMA epoch with the synchronization method that started the epoch."

Unfortunately, MPI_Accumulate() is not quite the same as
point-to-point, as a reduction is involved. Suppose you make this call
(let me abuse and use keyword arguments):

MPI_Accumulate(..., target_rank=MPI_PROC_NULL,
target_datatype=MPI_BYTE, op=MPI_SUM, ...)

IIUC, the call fails (with MPI_ERR_OP) in Open MPI because MPI_BYTE is
an invalid datatype for MPI_SUM.

But provided that the target rank is MPI_PROC_NULL, would it make
sense for the call to success?


I believe no.  We do full argument error checking (that you provided a 
valid communicator and datatype) on send, receive, put, and get when the 
source/dest is MPI_PROC_NULL.  Therefore, I think it's logical that we 
extend that to include valid operations for accumulate.


Brian


Re: [OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r21686

2009-07-15 Thread Ralph Castain

Okay, George - this is fixed in r21690.

Thanks again
Ralph

On Jul 15, 2009, at 2:40 PM, Ralph Castain wrote:


Ah - interesting scenario!

Definitely a "bug" in the code, then. What it looks like, though, is  
that the jdata->num_procs is wrong. There shouldn't be any way that  
the num_procs in the node array is different than jdata->num_procs.


My guess is that the rank_file mapper isn't correctly maintaining  
the bookkeeping when we map the procs beyond those in the rankfile.  
I'll dig into it - have to fix something for Lenny anyway.


Meantime, this change looks fine regardless as it (a) is better code  
and (b) protects us against such errors.


Thanks for catching it!
Ralph


On Wed, Jul 15, 2009 at 2:30 PM, George Bosilca  
 wrote:
I think I found a better solution (in r21688). Here is what I was  
trying to do.


I have a more or less homogeneous cluster. In fact all processors  
are identical, except that some are quad core and some dual core. Of  
course I care how my processes are mapped on the quad cores, but not  
really on the dual cores.


My approach was to use the following configuration files.

In /home/bosilca/.openmpi/mca-params.conf I have:

orte_default_hostfile=/home/bosilca/.openmpi/machinefile
rmaps_rank_file_path = /home/bosilca/.openmpi/rankfile
rmaps_rank_file_priority = 100

In /home/bosilca/.openmpi/machinefile I have the full description of  
the cluster. As an example:

node01 slots=4
node02 slots=4
node03 slots=2
node04 slots=2

And in the /home/bosilca/.openmpi/rankfile file I have:
rank 0=+n0 slot=0
rank 1=+n0 slot=1
rank 2=+n1 slot=0
rank 3=+n1 slot=1

As long as I spawn jobs with less than 4 processes everything worked  
fine. But when I used more than 4 processes, orterun segfaulted.  
After debugging I found that the nodes, lrank and nrank arrays were  
allocated based on the jdata->num_procs, but then filled based on  
the total number of processes in the jdata->nodes array. As it  
appears that the jdata->num_procs is somehow modified based on the  
number of entries in the rankfile, we end-up writing outside the  
allocation and then segfault. Now with the latest patch, we can cope  
with such a scenario by only packing the known information (and thus  
not writing outside the allocated arrays).


This might not be the best approach, but it is doing what I'm  
looking for ...


 george.


On Jul 15, 2009, at 15:50 , Ralph Castain wrote:

The routed comm system relies on each daemon having complete  
information as to where every process is located, so the expectation  
was that only full maps would ever be sent. Thus, the nidmap code is  
setup to always send a full map.


I don't know how to even generate a "partial" map. I assume you are  
doing something offline? Is this to update changed info? If so,  
you'll also have to do something to update the daemon's maps or the  
comm system will break down.


Ralph

On Wed, Jul 15, 2009 at 1:40 PM, George Bosilca  
 wrote:
I have a question regarding the mapping. How can I declare a partial  
mapping ? In fact I only care about how some of the processes are  
mapped on some specific nodes. Right now if the rmaps doesn't  
contain information about all nodes, we give up (before this patch  
we segfaulted).


Does it means we always have to declare the whole mapping or it's  
just that we overlooked this strange case?


 george.

Begin forwarded message:


Author: bosilca
Date: 2009-07-15 15:36:53 EDT (Wed, 15 Jul 2009)
New Revision: 21686
URL: https://svn.open-mpi.org/trac/ompi/changeset/21686

Log:
Reorder the nidmap encoding function. Add a check to make sure we  
don't write

outside the boundaries of the allocated array.

However, the problem is still there. If we have rmaps file  
containing only
partial information the num_procs get set to the wrong value (the  
number of
hosts in the rmaps file instead of the number of processes requested  
on the

command line).

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-15 Thread Chris Samuel

- "Ralph Castain"  wrote:

> Could you check this? You can run a trivial job using the -npernode x 
> option, where x matched the #cores you were allocated on the nodes.
> If you do this, do we bind to the correct cores?

Nope, I'm afraid it doesn't - submitted a job asking
for 4 cores on one node and was allocated cores 0-3 in
the cpuset.

Grep'ing the strace output for anything mentioning affinity shows:

[csamuel@tango027 CPI]$ fgrep affinity cpi-trace.txt
11412 execve("/usr/local/openmpi/1.3.3-gcc/bin/mpiexec", ["mpiexec", "--mca", 
"paffinity", "linux", "-npernode", "4", "/home/csamuel/Sources/Tests/CPI/"...], 
[/* 56 vars */]) = 0
11412 sched_getaffinity(0, 128,  { f }) = 8
11412 sched_setaffinity(0, 8,  { 0 })   = -1 EFAULT (Bad address)
11416 sched_getaffinity(0, 128,  
11416 <... sched_getaffinity resumed>  { f }) = 8
11416 sched_setaffinity(0, 8,  { 0 } 
11416 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11414 sched_getaffinity(0, 128,  
11414 <... sched_getaffinity resumed>  { f }) = 8
11414 sched_setaffinity(0, 8,  { 0 } 
11414 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11413 sched_getaffinity(0, 128,  
11413 <... sched_getaffinity resumed>  { f }) = 8
11413 sched_setaffinity(0, 8,  { 0 } 
11413 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11415 sched_getaffinity(0, 128,  
11415 <... sched_getaffinity resumed>  { f }) = 8
11415 sched_setaffinity(0, 8,  { 0 } 
11415 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11413 sched_getaffinity(11413, 8,  
11415 sched_getaffinity(11415, 8,  
11413 <... sched_getaffinity resumed>  { f }) = 8
11415 <... sched_getaffinity resumed>  { f }) = 8
11414 sched_getaffinity(11414, 8,  
11414 <... sched_getaffinity resumed>  { f }) = 8
11416 sched_getaffinity(11416, 8,  
11416 <... sched_getaffinity resumed>  { f }) = 8

I can confirm that it's not worked by checking what
plpa-taskset says about a process (for example 11414):

[root@tango027 plpa-taskset]# ./plpa-taskset -cp 11414
pid 11414's current affinity list: 0-3

According to the manual page:

   EFAULT A supplied memory address was invalid.

This is on a dual socket quad core AMD Shanghai system
running the 2.6.28.9 kernel (not had a chance to upgrade
recently).

Will do some more poking around after lunch.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-15 Thread Ralph Castain
Looking at your command line, did you remember to set -mca  
mpi_paffinity_alone 1? If not, we won't set affinity on the processes.



On Jul 15, 2009, at 8:11 PM, Chris Samuel wrote:



- "Ralph Castain"  wrote:


Could you check this? You can run a trivial job using the -npernode x
option, where x matched the #cores you were allocated on the nodes.
If you do this, do we bind to the correct cores?


Nope, I'm afraid it doesn't - submitted a job asking
for 4 cores on one node and was allocated cores 0-3 in
the cpuset.

Grep'ing the strace output for anything mentioning affinity shows:

[csamuel@tango027 CPI]$ fgrep affinity cpi-trace.txt
11412 execve("/usr/local/openmpi/1.3.3-gcc/bin/mpiexec", ["mpiexec",  
"--mca", "paffinity", "linux", "-npernode", "4", "/home/csamuel/ 
Sources/Tests/CPI/"...], [/* 56 vars */]) = 0

11412 sched_getaffinity(0, 128,  { f }) = 8
11412 sched_setaffinity(0, 8,  { 0 })   = -1 EFAULT (Bad address)
11416 sched_getaffinity(0, 128,  
11416 <... sched_getaffinity resumed>  { f }) = 8
11416 sched_setaffinity(0, 8,  { 0 } 
11416 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11414 sched_getaffinity(0, 128,  
11414 <... sched_getaffinity resumed>  { f }) = 8
11414 sched_setaffinity(0, 8,  { 0 } 
11414 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11413 sched_getaffinity(0, 128,  
11413 <... sched_getaffinity resumed>  { f }) = 8
11413 sched_setaffinity(0, 8,  { 0 } 
11413 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11415 sched_getaffinity(0, 128,  
11415 <... sched_getaffinity resumed>  { f }) = 8
11415 sched_setaffinity(0, 8,  { 0 } 
11415 <... sched_setaffinity resumed> ) = -1 EFAULT (Bad address)
11413 sched_getaffinity(11413, 8,  
11415 sched_getaffinity(11415, 8,  
11413 <... sched_getaffinity resumed>  { f }) = 8
11415 <... sched_getaffinity resumed>  { f }) = 8
11414 sched_getaffinity(11414, 8,  
11414 <... sched_getaffinity resumed>  { f }) = 8
11416 sched_getaffinity(11416, 8,  
11416 <... sched_getaffinity resumed>  { f }) = 8

I can confirm that it's not worked by checking what
plpa-taskset says about a process (for example 11414):

[root@tango027 plpa-taskset]# ./plpa-taskset -cp 11414
pid 11414's current affinity list: 0-3

According to the manual page:

  EFAULT A supplied memory address was invalid.

This is on a dual socket quad core AMD Shanghai system
running the 2.6.28.9 kernel (not had a chance to upgrade
recently).

Will do some more poking around after lunch.

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel