Re: [hwloc-devel] upcoming releases

2011-03-30 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 31/03/11 07:13, Brice Goglin wrote:

> Comments?

Sounds reasonable to me.

- -- 
Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2T46QACgkQO2KABBYQAh9DfgCeL/XMokMPaKTUnEJYm+kj3zwE
GzQAoIQLAzsayfT7yNNUxwXXcA2/ny8J
=cQDB
-END PGP SIGNATURE-


Re: [OMPI devel] Too many open files (24)

2011-03-30 Thread Samuel K. Gutierrez
Hi Tim,

Great news!  Happy calculating :-).

--
Samuel K. Gutierrez
Los Alamos National Laboratory

> Dear Samuel,
>
> Just as you replied I was trying that on the compute nodes. Surprise,
> surprise...the value returned as the hard and soft limits is 1024.
>
> Thanks for confirming my suspicions...
>
> Regards,
>
> Tim.
>
> On Mar 30, 2011, at 7:41 PM, Samuel K. Gutierrez wrote:
>
> Hi,
>
> It sounds like Open MPI is hitting your system's open file descriptor
> limit.  If that's the case, one potential workaround is to have your
> system administrator raise file descriptor limits.
>
> On a compute node, what does "ulimit -a" show (using bash)?
>
> Hope that helps,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Mar 30, 2011, at 5:22 PM, Timothy Stitt wrote:
>
> Dear OpenMPI developers,
>
> One of our users was running a benchmark on a 1032 core simulation. He had
> a successful run at 900 cores but when he stepped up to 1032 cores the job
> just stalled and his logs contained many occurrences of the following
> line:
>
> [d6copt368.crc.nd.edu][[25621,1],0][btl_tcp_component.c:885:mca_btl_tcp_component_accept_handler]
> accept() failed: Too many open files (24)
>
> The simulation has a single master task that communicates with all the
> other tasks to write out some I/O via the master. We are assuming the
> message is related to this bottleneck. Is there a 1024 limit on the number
> of open files/connections for instance?
>
> Can anyone confirm the meaning of this error and secondly provide a
> resolution that hopefully doesn't involve a code rewrite.
>
> Thanks in advance,
>
> Tim.
>
> Tim Stitt PhD (User Support Manager).
> Center for Research Computing | University of Notre Dame |
> P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email:
> tst...@nd.edu
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> 
>
> Tim Stitt PhD (User Support Manager).
> Center for Research Computing | University of Notre Dame |
> P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email:
> tst...@nd.edu
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[hwloc-devel] Create success (hwloc r1.1.2a1r3339)

2011-03-30 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success.

Snapshot:   hwloc 1.1.2a1r3339
Start time: Wed Mar 30 21:03:28 EDT 2011
End time:   Wed Mar 30 21:05:37 EDT 2011

Your friendly daemon,
Cyrador


[hwloc-devel] Create success (hwloc r1.2a1r3345)

2011-03-30 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success.

Snapshot:   hwloc 1.2a1r3345
Start time: Wed Mar 30 21:01:05 EDT 2011
End time:   Wed Mar 30 21:03:28 EDT 2011

Your friendly daemon,
Cyrador


Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Ralph Castain
Sorry - should have included the devel list when I sent this.


On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:

> I'm not the expert on this area - Josh is, so I'll defer to him. I did take a 
> quick glance at the sstore framework, though, and it looks like there are 
> some params you could set that might help.
> 
> "ompi_info --param sstore all"
> 
> should tell you what's available. Also, note that Josh created a man page to 
> explain how sstore works. It's in section 7, looks like "man orte_sstore" 
> should get it.
> 
> 
> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
> 
>> Hello again.
>> 
>> I'm working in the launch code to handle my checkpoints, but i'm a little 
>> stuck in how to set the path to my checkpoint and the executable 
>> (ompi_blcr_context.PID). I take a look at the code in 
>> odls_base_default_fns.c and this piece of code took my attention:
>> 
>> #if OPAL_ENABLE_FT_CR == 1
>> /*
>>  * OPAL CRS components need the opportunity to take action 
>> before a process
>>  * is forked.
>>  * Needs access to:
>>  *   - Environment
>>  *   - Rank/ORTE Name
>>  *   - Binary to exec
>>  */
>> if( NULL != opal_crs.crs_prelaunch ) {
>> if( OPAL_SUCCESS != (rc = 
>> opal_crs.crs_prelaunch(child->name->vpid,
>>  
>> orte_sstore_base_prelaunch_location,
>>  &(app->app),
>>  &(app->cwd),
>>  
>> &(app->argv),
>>  &(app->env) 
>> ) ) ) {
>> ORTE_ERROR_LOG(rc);
>> goto CLEANUP;
>> }
>> }
>> #endif
>> 
>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i now 
>> that initially this is set in the sstore_base_open. For example, as i'm 
>> transfering my checkpoint from one node to another, i store the checkpoint 
>> that has to be restore in /tmp/1/ and it has a name like 
>> ompi_blcr_context.PID.
>> 
>> Is there any function that i didn't see that allows me to do this? I'm 
>> asking this because I do not want to change the signature of the functions 
>> to pass the details of the checkpoint and the PID.
>> 
>> Best Regards.
>> 
>> Hugo Meyer
>> 
>> 2011/3/30 Hugo Meyer 
>> Thanks Ralph.
>> I have finished the (a) point, and now its working, now i have to work to 
>> relaunch from my checkpoint as you said.
>> 
>> Best regards.
>> 
>> Hugo Meyer
>> 
>> 
>> 2011/3/29 Ralph Castain 
>> The resilient mapper -only- works on procs being restarted - it cannot map a 
>> job for its initial launch. You shouldn't set any rmaps flag and things will 
>> work correctly - the default round-robin mapper will map the initial launch, 
>> and then the resilient mapper will handle restarts.
>> 
>> 
>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>> 
>>> Ralph.
>>> 
>>> I'm having a problem when i try to select the rmaps resilient to be used:
>>> 
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile 
>>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm 
>>> rsh -mca routed cm ./coll 6 10 2>out.txt 
>>> 
>>> I get this as error:
>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for 
>>> nodes
>>> --
>>> Your job failed to map. Either no mapper was available, or none
>>> of the available mappers was able to perform the requested
>>> mapping operation. This can happen if you request a map type
>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>> 
>>> --
>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) --- App. Process 
>>> state updated for process NULL
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER 
>>> LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER 
>>> LAUNCHED
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with 
>>> status 1
>>> 
>>> Is there a flag that i'm not turning on? or a component that i should have 
>>> selected?
>>> 
>>> Thanks again.
>>> 
>>> Hugo Meyer
>>> 
>>> 
>>> 2011/3/26 Hugo Meyer 
>>> Ok Ralph.
>>> 
>>> Thanks a lot for your help, i will do as you said and then let you know how 
>>> it goes.
>>> 
>>> Best Regards.
>>> 
>>> Hugo Meyer
>>> 
>>> 
>>> 2011/3/25 Ralph Castain 
>>> 
>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>> 
 From what you've described before, I suspect all you'll need 

Re: [OMPI devel] Too many open files (24)

2011-03-30 Thread Timothy Stitt
Dear Samuel,

Just as you replied I was trying that on the compute nodes. Surprise, 
surprise...the value returned as the hard and soft limits is 1024.

Thanks for confirming my suspicions...

Regards,

Tim.

On Mar 30, 2011, at 7:41 PM, Samuel K. Gutierrez wrote:

Hi,

It sounds like Open MPI is hitting your system's open file descriptor limit.  
If that's the case, one potential workaround is to have your system 
administrator raise file descriptor limits.

On a compute node, what does "ulimit -a" show (using bash)?

Hope that helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 30, 2011, at 5:22 PM, Timothy Stitt wrote:

Dear OpenMPI developers,

One of our users was running a benchmark on a 1032 core simulation. He had a 
successful run at 900 cores but when he stepped up to 1032 cores the job just 
stalled and his logs contained many occurrences of the following line:

[d6copt368.crc.nd.edu][[25621,1],0][btl_tcp_component.c:885:mca_btl_tcp_component_accept_handler]
 accept() failed: Too many open files (24)

The simulation has a single master task that communicates with all the other 
tasks to write out some I/O via the master. We are assuming the message is 
related to this bottleneck. Is there a 1024 limit on the number of open 
files/connections for instance?

Can anyone confirm the meaning of this error and secondly provide a resolution 
that hopefully doesn't involve a code rewrite.

Thanks in advance,

Tim.

Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email: 
tst...@nd.edu

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email: 
tst...@nd.edu



Re: [OMPI devel] Too many open files (24)

2011-03-30 Thread Samuel K. Gutierrez

Hi,

It sounds like Open MPI is hitting your system's open file descriptor  
limit.  If that's the case, one potential workaround is to have your  
system administrator raise file descriptor limits.


On a compute node, what does "ulimit -a" show (using bash)?

Hope that helps,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 30, 2011, at 5:22 PM, Timothy Stitt wrote:


Dear OpenMPI developers,

One of our users was running a benchmark on a 1032 core simulation.  
He had a successful run at 900 cores but when he stepped up to 1032  
cores the job just stalled and his logs contained many occurrences  
of the following line:


[d6copt368.crc.nd.edu][[25621,1],0][btl_tcp_component.c: 
885:mca_btl_tcp_component_accept_handler] accept() failed: Too many  
open files (24)


The simulation has a single master task that communicates with all  
the other tasks to write out some I/O via the master. We are  
assuming the message is related to this bottleneck. Is there a 1024  
limit on the number of open files/connections for instance?


Can anyone confirm the meaning of this error and secondly provide a  
resolution that hopefully doesn't involve a code rewrite.


Thanks in advance,

Tim.

Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email: tst...@nd.edu

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Too many open files (24)

2011-03-30 Thread Timothy Stitt
Dear OpenMPI developers,

One of our users was running a benchmark on a 1032 core simulation. He had a 
successful run at 900 cores but when he stepped up to 1032 cores the job just 
stalled and his logs contained many occurrences of the following line:

[d6copt368.crc.nd.edu][[25621,1],0][btl_tcp_component.c:885:mca_btl_tcp_component_accept_handler]
 accept() failed: Too many open files (24)

The simulation has a single master task that communicates with all the other 
tasks to write out some I/O via the master. We are assuming the message is 
related to this bottleneck. Is there a 1024 limit on the number of open 
files/connections for instance?

Can anyone confirm the meaning of this error and secondly provide a resolution 
that hopefully doesn't involve a code rewrite.

Thanks in advance,

Tim.

Tim Stitt PhD (User Support Manager).
Center for Research Computing | University of Notre Dame |
P.O. Box 539, Notre Dame, IN 46556 | Phone:  574-631-5287 | Email: 
tst...@nd.edu



Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
Hello again.

I'm working in the launch code to handle my checkpoints, but i'm a little
stuck in how to set the path to my checkpoint and the executable
(ompi_blcr_context.PID). I take a look at the code in
odls_base_default_fns.c and this piece of code took my attention:

#if OPAL_ENABLE_FT_CR == 1
/*
 * OPAL CRS components need the opportunity to take action
before a process
 * is forked.
 * Needs access to:
 *   - Environment
 *   - Rank/ORTE Name
 *   - Binary to exec
 */
if( NULL != opal_crs.crs_prelaunch ) {
if( OPAL_SUCCESS != (rc =
opal_crs.crs_prelaunch(child->name->vpid,

orte_sstore_base_prelaunch_location,

&(app->app),

&(app->cwd),

&(app->argv),
 &(app->env)
) ) ) {
ORTE_ERROR_LOG(rc);
goto CLEANUP;
}
}
#endif


But i didn't find out how to set orte_sstore_base_prelaunch_location, i now
that initially this is set in the sstore_base_open. For example, as i'm
transfering my checkpoint from one node to another, i store the checkpoint
that has to be restore in /tmp/1/ and it has a name like
ompi_blcr_context.PID.

Is there any function that i didn't see that allows me to do this? I'm
asking this because I do not want to change the signature of the functions
to pass the details of the checkpoint and the PID.

Best Regards.

Hugo Meyer

2011/3/30 Hugo Meyer 

> Thanks Ralph.
> I have finished the (a) point, and now its working, now i have to work to
> relaunch from my checkpoint as you said.
>
> Best regards.
>
> Hugo Meyer
>
>
> 2011/3/29 Ralph Castain 
>
>> The resilient mapper -only- works on procs being restarted - it cannot map
>> a job for its initial launch. You shouldn't set any rmaps flag and things
>> will work correctly - the default round-robin mapper will map the initial
>> launch, and then the resilient mapper will handle restarts.
>>
>>
>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>
>> Ralph.
>>
>> I'm having a problem when i try to select the rmaps resilient to be used:
>>
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
>> rsh -mca routed cm ./coll 6 10 2>out.txt
>>
>>
>> I get this as error:
>>
>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
>> nodes
>> --
>> Your job failed to map. Either no mapper was available, or none
>> of the available mappers was able to perform the requested
>> mapping operation. This can happen if you request a map type
>> (e.g., loadbalance) and the corresponding mapper was not built.
>>
>> --
>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) --- App.
>> Process state updated for process NULL
>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER
>> LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER
>> LAUNCHED
>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with
>> status 1
>>
>>
>> Is there a flag that i'm not turning on? or a component that i should have
>> selected?
>>
>> Thanks again.
>>
>> Hugo Meyer
>>
>>
>> 2011/3/26 Hugo Meyer 
>>
>>> Ok Ralph.
>>>
>>> Thanks a lot for your help, i will do as you said and then let you know
>>> how it goes.
>>>
>>> Best Regards.
>>>
>>> Hugo Meyer
>>>
>>>
>>> 2011/3/25 Ralph Castain 
>>>

 On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:

 From what you've described before, I suspect all you'll need to do is
> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) 
> checks
> to see if a process in the launch message is being relocated (the
> construct_child_list code does that already), and then (b) sends the
> required info to all local child processes so they can take appropriate
> action.
>
> Failure detection, re-launch, etc. have all been taken care of for you.
>


 I looked at the code that you mentioned me and i realize that i have
 two possible options, that i'm going to share with you to know your 
 opinion.

 First of all i will let you know my actual situation with the
 implementation. As i'm working in a Fault Tolerant system, but using
 uncoordinated checkpoint i'm taking checkpoints of all my process at
 different time and storing them on the machine where there are residing, 
 but
 i also send this checkpoints to another node (lets call it protector), so 
 if
 this node fails his process should be restarted in the protector 

[hwloc-devel] upcoming releases

2011-03-30 Thread Brice Goglin
We talk with Jeff today and came to following proposal:

We're doing to 1.1.2 now (let's flush out many small fixes that are
pending since early February).

I am not confident enough to release the PCI stuff now because it didn't
get much real testing (it still misses a bit of work in the tools and
doc). So a new plan could be:
1) branch 1.2 from current trunk and do a first RC now, so that the
current changes do not wait any longer
2) merge libpci into trunk right after branching 1.2, so that it gets
more actual testing (this branch
3) once the final 1.2 is released, do a first 1.3 RC if nothing went
wrong in the meantime

Current trunk already has a very long changelog, but I think it works
well. And it shouldn't require many doc updates to do during the RC
cycles. So the final 1.2 should arrive quickly, and we can seriously
expect a first PCI-enabled RC before summer.

Comments?

Brice



[OMPI devel] Fwd: [devel-core] Open MPI Developers Meeting

2011-03-30 Thread Joshua Hursey
Rich wanted to make this available to a broader audience. Re-posting to the 
devel list.

Begin forwarded message:

> From: Joshua Hursey 
> Date: March 30, 2011 9:14:03 AM CDT
> Subject: [devel-core] Open MPI Developers Meeting
> 
> It has been requested that we have a face-to-face Open MPI developers 
> meeting. It has been a long time since we were all in the same room to 
> discuss issues. Oak Ridge is willing to host the event.
> 
> 
> To get the ball rolling we need to decide two things:
> 
> 1) When would be the best 3 days that work for the most developers. Please 
> fill out the doodle poll by the next teleconf (April 5) so we can set the 
> date.
>  http://doodle.com/c59p4hrxqu2d9rmu
> 
> 2) What topics do we want on the agenda? I have a few items, but I'll bring 
> those forward later.
> 
> 
> Please send agenda items to the list. I'll bring this up on the next teleconf 
> as well.
> 
> -- Josh


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey