Re: [OMPI devel] orte question

2011-07-23 Thread Ralph Castain
Gar - have to eat my words a bit. The jobid requested by orte-ps is just the 
"local" jobid - i.e., it is expecting you to provide a number from 0-N, as I 
described below (copied here):

> A jobid of 1 indicates the primary application, 2 and above would specify 
> comm_spawned jobs. 

Not providing the jobid at all corresponds to wildcard and returns the status 
of all jobs under that mpirun.

To specify which mpirun you want info on, you use the --pid option. It is this 
option that isn't working properly - orte-ps returns info from all mpiruns and 
doesn't check to provide only data from the given pid.

I'll fix that part, and implement the parsable output.


On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:

> 
> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
> 
>> Hi Ralph,
>> 
>> I'd like three things :-)
>> 
>> a) A --report-jobid option that prints the jobid on the first line in a form 
>> that can be passed to the -jobid option on ompi-ps. Probably tagging it in 
>> the output if -tag-output is enabled (e.g. jobid:) would be a good 
>> idea.
>> 
>> b) The orte-ps command output to use the same jobid format.
> 
> I started looking at the above, and found that orte-ps is just plain wrong in 
> the way it handles jobid. The jobid consists of two fields: a 16-bit number 
> indicating the mpirun, and a 16-bit number indicating the job within that 
> mpirun. Unfortunately, orte-ps sends a data request to every mpirun out there 
> instead of only to the one corresponding to that jobid.
> 
> What we probably should do is have you indicate the mpirun of interest via 
> the -pid option, and then let jobid tell us which job you want within that 
> mpirun. A jobid of 1 indicates the primary application, 2 and above would 
> specify comm_spawned jobs. A jobid of -1 would return the status of all jobs 
> under that mpirun.
> 
> If multiple mpiruns are being reported, then the "jobid" in the report should 
> again be the "local" jobid within that mpirun.
> 
> After all, you don't really care what the orte-internal 16-bit identifier is 
> for that mpirun.
> 
>> 
>> c) A more easily parsable output format from ompi-ps. It doesn't need to be 
>> a full blown XML format, just something like the following would suffice:
>> 
>> jobid:719585280:state:Running:slots:1:num procs:4
>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>> jobid:345346663:state:running:slots:1:num procs:2
>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
> 
> Shouldn't be too hard to do - bunch of if-then-else statements required, 
> though.
> 
>> 
>> I'd be happy to help with any or all of these.
> 
> Appreciate the offer - let me see how hard this proves to be...
> 
>> 
>> Cheers,
>> Greg
>> 
>> On Jul 22, 2011, at 10:18 AM, Ralph Castain wrote:
>> 
>>> Hmmm...well, it looks like we could have made this nicer than we did :-/
>>> 
>>> If you add --report-uri to the mpirun command line, you'll get back the uri 
>>> for that mpirun. This has the form of :. As the -h option 
>>> indicates:
>>> 
>>> -report-uri | --report-uri   
>>>   Printout URI on stdout [-], stderr [+], or a file
>>>   [anything else]
>>> 
>>> The "jobid" required by the orte-ps command is the one reported there. We 
>>> could easily add a --report-jobid option if that makes things easier.
>>> 
>>> As to the difference in how orte-ps shows the jobid...well, that's probably 
>>> historical. orte-ps uses an orte utility function to print the jobid, and 
>>> that utility always shows the jobid in component form. Again, could add or 
>>> just use the integer version.
>>> 
>>> 
>>> On Jul 22, 2011, at 7:01 AM, Greg Watson wrote:
>>> 
 Hi all,
 
 Does anyone know if it's possible to get the orte jobid from the mpirun 
 command? If not, how are you supposed to get it to use with orte-ps? Also, 
 orte-ps reports the jobid in [x,y] notation, but the jobid argument seems 
 to be an integer. How does that work?
 
 Thanks,
 Greg
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 




[OMPI devel] need help to add a module

2011-07-23 Thread bin wang
hello,

I'm trying to add a module into open-mpi MCA framework.
I hope the module could be conditionally compiled and linked.
It should be disabled by default and enabled by certain flags at configure
step.

When I make a dynamic module, everything is working fine.
The problem is that when I make a static compilation/link,
the compiler would complain the component variable is not defined.

In my build log, I found something as following,
 1414 --- MCA component btl:mx (m4 configuration macro)
1415 checking for MCA component btl:mx compile mode... static
1416 checking --with-mx value... simple ok (unspecified)
1417 checking --with-mx-libdir value... simple ok (unspecified)
1418 checking myriexpress.h usability... no
1419 checking myriexpress.h presence... no
1420 checking for myriexpress.h... no
1421 checking if MCA component btl:mx can compile... no

Correspondingly the ompi/mca/btl/base/static-components.h had no
declaration of extern mca_btl_mx_component variable.

I think this is the behavior I expect for my module. I checked the
Makefile.am files
but found nothing special.

I'm not familiar with those autotools, can anyone give me some detailed
guidance on what I should do?

Thanks in advance.

-- 
Bin WANG


[OMPI devel] shmem error msg

2011-07-23 Thread Ralph Castain
Whenever I run valgrind on orterun (or any OMPI tool), I get the following 
error msg:

--
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  Ralph
  System call: shm_unlink(2) 
  Error:   Function not implemented (errno 78)
--

It's coming out of open-rte/help-opal-shmem-posix.txt.

Everything continues, so I'm not sure what this is all about. Anyone recognize 
this???

It's on the trunk, running on a Mac, vanilla configure.
Ralph




Re: [OMPI devel] orte question

2011-07-23 Thread Ralph Castain
Okay, you should have it in r24929. Use:

orte-ps --parseable

to get the new output.


On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote:

> Gar - have to eat my words a bit. The jobid requested by orte-ps is just the 
> "local" jobid - i.e., it is expecting you to provide a number from 0-N, as I 
> described below (copied here):
> 
>> A jobid of 1 indicates the primary application, 2 and above would specify 
>> comm_spawned jobs. 
> 
> Not providing the jobid at all corresponds to wildcard and returns the status 
> of all jobs under that mpirun.
> 
> To specify which mpirun you want info on, you use the --pid option. It is 
> this option that isn't working properly - orte-ps returns info from all 
> mpiruns and doesn't check to provide only data from the given pid.
> 
> I'll fix that part, and implement the parsable output.
> 
> 
> On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:
> 
>> 
>> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
>> 
>>> Hi Ralph,
>>> 
>>> I'd like three things :-)
>>> 
>>> a) A --report-jobid option that prints the jobid on the first line in a 
>>> form that can be passed to the -jobid option on ompi-ps. Probably tagging 
>>> it in the output if -tag-output is enabled (e.g. jobid:) would be a 
>>> good idea.
>>> 
>>> b) The orte-ps command output to use the same jobid format.
>> 
>> I started looking at the above, and found that orte-ps is just plain wrong 
>> in the way it handles jobid. The jobid consists of two fields: a 16-bit 
>> number indicating the mpirun, and a 16-bit number indicating the job within 
>> that mpirun. Unfortunately, orte-ps sends a data request to every mpirun out 
>> there instead of only to the one corresponding to that jobid.
>> 
>> What we probably should do is have you indicate the mpirun of interest via 
>> the -pid option, and then let jobid tell us which job you want within that 
>> mpirun. A jobid of 1 indicates the primary application, 2 and above would 
>> specify comm_spawned jobs. A jobid of -1 would return the status of all jobs 
>> under that mpirun.
>> 
>> If multiple mpiruns are being reported, then the "jobid" in the report 
>> should again be the "local" jobid within that mpirun.
>> 
>> After all, you don't really care what the orte-internal 16-bit identifier is 
>> for that mpirun.
>> 
>>> 
>>> c) A more easily parsable output format from ompi-ps. It doesn't need to be 
>>> a full blown XML format, just something like the following would suffice:
>>> 
>>> jobid:719585280:state:Running:slots:1:num procs:4
>>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>>> jobid:345346663:state:running:slots:1:num procs:2
>>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
>> 
>> Shouldn't be too hard to do - bunch of if-then-else statements required, 
>> though.
>> 
>>> 
>>> I'd be happy to help with any or all of these.
>> 
>> Appreciate the offer - let me see how hard this proves to be...
>> 
>>> 
>>> Cheers,
>>> Greg
>>> 
>>> On Jul 22, 2011, at 10:18 AM, Ralph Castain wrote:
>>> 
 Hmmm...well, it looks like we could have made this nicer than we did :-/
 
 If you add --report-uri to the mpirun command line, you'll get back the 
 uri for that mpirun. This has the form of :. As the -h option 
 indicates:
 
 -report-uri | --report-uri   
  Printout URI on stdout [-], stderr [+], or a file
  [anything else]
 
 The "jobid" required by the orte-ps command is the one reported there. We 
 could easily add a --report-jobid option if that makes things easier.
 
 As to the difference in how orte-ps shows the jobid...well, that's 
 probably historical. orte-ps uses an orte utility function to print the 
 jobid, and that utility always shows the jobid in component form. Again, 
 could add or just use the integer version.
 
 
 On Jul 22, 2011, at 7:01 AM, Greg Watson wrote:
 
> Hi all,
> 
> Does anyone know if it's possible to get the orte jobid from the mpirun 
> command? If not, how are you supposed to get it to use with orte-ps? 
> Also, orte-ps reports the jobid in [x,y] notation, but the jobid argument 
> seems to be an integer. How does that work?
> 
> Thanks,
> Greg
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://w

Re: [OMPI devel] orte question

2011-07-23 Thread Ashley Pittman

On 23 Jul 2011, at 03:55, Ralph Castain wrote:
>> c) A more easily parsable output format from ompi-ps. It doesn't need to be 
>> a full blown XML format, just something like the following would suffice:
>> 
>> jobid:719585280:state:Running:slots:1:num procs:4
>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>> jobid:345346663:state:running:slots:1:num procs:2
>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
> 
> Shouldn't be too hard to do - bunch of if-then-else statements required, 
> though.

I've been more than happy with the current output, the only problem I've had in 
the time I've been using it is some extra fields that are appended if using 
checkpoint-restart.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI devel] orte question

2011-07-23 Thread Ralph Castain

On Jul 23, 2011, at 5:04 PM, Ashley Pittman wrote:

> 
> On 23 Jul 2011, at 03:55, Ralph Castain wrote:
>>> c) A more easily parsable output format from ompi-ps. It doesn't need to be 
>>> a full blown XML format, just something like the following would suffice:
>>> 
>>> jobid:719585280:state:Running:slots:1:num procs:4
>>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>>> jobid:345346663:state:running:slots:1:num procs:2
>>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
>> 
>> Shouldn't be too hard to do - bunch of if-then-else statements required, 
>> though.
> 
> I've been more than happy with the current output, the only problem I've had 
> in the time I've been using it is some extra fields that are appended if 
> using checkpoint-restart.

You don't have to use the new one - just don't put --parseable on the cmd line. 
I can see that this is easier to parse, though.


> 
> Ashley.
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] shmem error msg

2011-07-23 Thread Samuel K. Gutierrez
Hi Ralph,

That's mine - I'll take a look.

Thanks,

Sam

> Whenever I run valgrind on orterun (or any OMPI tool), I get the following
> error msg:
>
> --
> A system call failed during shared memory initialization that should
> not have.  It is likely that your MPI job will now either abort or
> experience performance degradation.
>
>   Local host:  Ralph
>   System call: shm_unlink(2)
>   Error:   Function not implemented (errno 78)
> --
>
> It's coming out of open-rte/help-opal-shmem-posix.txt.
>
> Everything continues, so I'm not sure what this is all about. Anyone
> recognize this???
>
> It's on the trunk, running on a Mac, vanilla configure.
> Ralph
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>