Re: [OMPI devel] Error message improvement

2009-09-09 Thread Nysal Jan
__FUNCTION__ is not portable.
__func__ is but it needs a C99 compliant compiler.

--Nysal

On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky  wrote:

> fixed in r21952
> thanks.
>
> On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet wrote:
>
>> Lenny Verkhovsky wrote:
>>
>>> Why not using __FUNCTION__  in all our error messages ???
>>>
>>
>> Sounds good, this way the function names are always correct.
>>
>> --
>> Greetings, A. Huillet
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Error message improvement

2009-09-09 Thread Lenny Verkhovsky
Hi All,
does C99 complient compiler is something unusual
or is there a policy among OMPI developers/users that prevent me f
rom using __func__  instead of hardcoded strings in the code ?
Thanks.
Lenny.

On Wed, Sep 9, 2009 at 1:48 PM, Nysal Jan  wrote:

> __FUNCTION__ is not portable.
> __func__ is but it needs a C99 compliant compiler.
>
> --Nysal
>
> On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> fixed in r21952
>> thanks.
>>
>> On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet 
>> wrote:
>>
>>> Lenny Verkhovsky wrote:
>>>
 Why not using __FUNCTION__  in all our error messages ???

>>>
>>> Sounds good, this way the function names are always correct.
>>>
>>> --
>>> Greetings, A. Huillet
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Error message improvement

2009-09-09 Thread George Bosilca
__func__ is what you should use. We take care of having it defined in  
_all_ cases. If the compiler doesn't support it we define it manually  
(to __FUNCTION__ or to __FILE__ in the worst case), so it is always  
available (even if it doesn't contain what one might expect such in  
the case of __FILE__).


  george.

On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote:


Hi All,
does C99 complient compiler is something unusual
or is there a policy among OMPI developers/users that prevent me f
rom using __func__  instead of hardcoded strings in the code ?
Thanks.
Lenny.

On Wed, Sep 9, 2009 at 1:48 PM, Nysal Jan  wrote:
__FUNCTION__ is not portable.
__func__ is but it needs a C99 compliant compiler.

--Nysal


On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky > wrote:

fixed in r21952
thanks.

On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet > wrote:

Lenny Verkhovsky wrote:
Why not using __FUNCTION__  in all our error messages ???

Sounds good, this way the function names are always correct.


--
Greetings, A. Huillet

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Error message improvement

2009-09-09 Thread N.M. Maclaren

On Sep 9 2009, George Bosilca wrote:

On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote:


does C99 complient compiler is something unusual
or is there a policy among OMPI developers/users that prevent me f
rom using __func__  instead of hardcoded strings in the code ?


__func__ is what you should use. We take care of having it defined in  
_all_ cases. If the compiler doesn't support it we define it manually  
(to __FUNCTION__ or to __FILE__ in the worst case), so it is always  
available (even if it doesn't contain what one might expect such in  
the case of __FILE__).


That's a good, practical solution.  A slight rider is that you shouldn't
be clever with it - such as using it in preprocessor statements.  I tried
some tests at one stage, and there were 'interesting' variations on how
different compilers interpreted C99.  Let alone the fact that it might
map to something else, with different rules.  If you need to play such
games, use hard-coded names.

Things may have stabilised since then, but I wouldn't bet on it.

Regards,
Nick Maclaren.




Re: [OMPI devel] Error message improvement

2009-09-09 Thread Lenny Verkhovsky
fixed in r21956
__FUNCTION__ was replaced with __func__
thanks.
Lenny.

On Wed, Sep 9, 2009 at 2:59 PM, N.M. Maclaren  wrote:

> On Sep 9 2009, George Bosilca wrote:
>
>> On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote:
>>
>>  does C99 complient compiler is something unusual
>>> or is there a policy among OMPI developers/users that prevent me f
>>> rom using __func__  instead of hardcoded strings in the code ?
>>>
>>
>> __func__ is what you should use. We take care of having it defined in
>>  _all_ cases. If the compiler doesn't support it we define it manually  (to
>> __FUNCTION__ or to __FILE__ in the worst case), so it is always  available
>> (even if it doesn't contain what one might expect such in  the case of
>> __FILE__).
>>
>
> That's a good, practical solution.  A slight rider is that you shouldn't
> be clever with it - such as using it in preprocessor statements.  I tried
> some tests at one stage, and there were 'interesting' variations on how
> different compilers interpreted C99.  Let alone the fact that it might
> map to something else, with different rules.  If you need to play such
> games, use hard-coded names.
>
> Things may have stabilised since then, but I wouldn't bet on it.
>
> Regards,
> Nick Maclaren.
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] fix 2014: Problems in romio

2009-09-09 Thread pascal . deveze
I have seen that ROMIO goes wrong with fix 2014: A lot of ROMIO tests in
ompi/mca/io/romio/romio/test/ are failing
For example, with noncontig_coll2:

[inti15:28259] *** Process received signal ***
[inti15:28259] Signal: Segmentation fault (11)
[inti15:28259] Signal code: Address not mapped (1)
[inti15:28259] Failing at address: (nil)
[inti15:28259] [ 0] /lib64/libpthread.so.0 [0x3f19c0e4c0]
[inti15:28259] [ 1]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_btl_openib.so
[0x2b6640c74d79]
[inti15:28259] [ 2]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_rml_oob.so
[0x2b663e2e6e92]
[inti15:28259] [ 3]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so
[0x2b663e4f8e63]
[inti15:28259] [ 4]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so
[0x2b663e4ff485]
[inti15:28259] [ 5]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_event_loop+0x5df)
 [0x2b663d3d92ff]
[inti15:28259] [ 6]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_progress+0x5e)
 [0x2b663d3ba33e]
[inti15:28259] [ 7] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0
[0x2b663ce26624]
[inti15:28259] [ 8]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so
[0x2b664217fda2]
[inti15:28259] [ 9]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so
[0x2b6642179966]
[inti15:28259] [10]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_Alltoall+0x6f)
[0x2b663ce352ef]
[inti15:28259] [11]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_Calc_others_req+0x65)
 [0x2aaab1cfc525]
[inti15:28259] [12]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_GEN_WriteStridedColl+0x433)
 [0x2aaab1cf0ac3]
[inti15:28259] [13]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(MPIOI_File_write_all+0xc0)
 [0x2aaab1d0a8f0]
[inti15:28259] [14]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_write_all+0x23)
 [0x2aaab1d0a823]
[inti15:28259] [15]
/home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so
[0x2aaab1cedce9]
[inti15:28259] [16]
/home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_File_write_all+0x4e)
 [0x2b663ce64f9e]
[inti15:28259] [17] ./noncontig_coll2(test_file+0x32b) [0x4034bb]
[inti15:28259] [18] ./noncontig_coll2(main+0x58b) [0x402d03]
[inti15:28259] [19] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f1901d974]
[inti15:28259] [20] ./noncontig_coll2 [0x4026c9]
[inti15:28259] *** End of error message ***

All the ROMIO tests pass without this fix

Is there a problem in ROMIO with the datatype interface ?

Pascal

Here is the export of the corresponding patch:

hg export 16301
# HG changeset patch
# User rusraink
# Date 1251912841 0
# Node ID eefd4bd4551969dc7454e63c2f42871cc9376a8f
# Parent  8aab76743e58474f1341be6f9d0ac9ae338507f1
 - This fixes #2014:
   As noted in
http://www.open-mpi.org/community/lists/devel/2009/08/6741.php,
   we do not correctly free a dupped predefined datatype.
   The fix is a bit more involving. See ticket for details.
   Tested with ibm tests and mpi_test_suite (though there's two "old"
failures
   zero5.c and zero6.c)

   Thanks to Lisandro Dalcin for bringing this up.

diff -r 8aab76743e58 -r eefd4bd45519 ompi/datatype/ompi_datatype.h
--- a/ompi/datatype/ompi_datatype.h Wed Sep 02 11:23:54 2009 +
+++ b/ompi/datatype/ompi_datatype.h Wed Sep 02 17:34:01 2009 +
@@ -202,11 +202,14 @@
 }
 opal_datatype_clone ( &oldType->super, &new_ompi_datatype->super);

+new_ompi_datatype->super.flags &= (~OMPI_DATATYPE_FLAG_PREDEFINED);
+
 /* Set the keyhash to NULL -- copying attributes is *only* done at
the top level (specifically, MPI_TYPE_DUP). */
 new_ompi_datatype->d_keyhash = NULL;
 new_ompi_datatype->args = NULL;
-strncpy (new_ompi_datatype->name, oldType->name, MPI_MAX_OBJECT_NAME);
+snprintf (new_ompi_datatype->name, MPI_MAX_OBJECT_NAME, "Dup %s",
+  oldType->name);

 return OMPI_SUCCESS;
 }
diff -r 8aab76743e58 -r eefd4bd45519 opal/datatype/opal_datatype_clone.c
--- a/opal/datatype/opal_datatype_clone.c   Wed Sep 02 11:23:54 2009
+
+++ b/opal/datatype/opal_datatype_clone.c   Wed Sep 02 17:34:01 2009
+
@@ -33,9 +33,13 @@
 int32_t opal_datatype_clone( const opal_datatype_t * src_type,
opal_datatype_t * dest_type )
 {
 int32_t desc_length = src_type->desc.used + 1;  /* +1 because of the
fake OPAL_DATATYPE_END_LOOP entry */
-dt_elem_desc_t* temp = dest_type->desc.desc; /* temporary copy of the
desc pointer */
+dt_elem_desc_t* temp = dest_type->desc.desc;/* temporary copy of
the desc pointer */

-memcpy( dest_type, src_type, sizeof(opal_datatype_t) );
+/* copy _excluding_ the super object, we want to keep the
cls_destruct_array */
+memcpy( dest_type+sizeof(opal_object_t),
+src_type+sizeof(opal_object_t),
+sizeof(opal_datatype_t)-sizeof(opal_object_t) )

Re: [OMPI devel] XML request

2009-09-09 Thread Greg Watson

Hi Ralph,

Looks good so far. The way I want to use is this to use /dev/tty as  
the xml-file and send any other stdout or stderr to /dev/null. I could  
use something like 'mpirun -xml-file /dev/tty  >/dev/null 2>&1',  
but the syntax is shell specific which causes a problem the ssh exec  
service. I noticed that mpirun has a -output-filename option, but when  
I try -output-filename /dev/null, I get:


[Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to create  
directory (/dev), unable to set the correct mode [-1]
[Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file  
ess_hnp_module.c at line 406


Also, I'm not sure if -output-filename redirects both stdout and  
stderr, or just stdout.


Any suggestions would be appreciated.

Thanks,
Greg


On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote:

Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- 
file foo as discussed below.


You can also specify it as an MCA param: -mca orte_xml_file foo, or  
OMPI_MCA_orte_xml_file=foo


Let me know how it works
Ralph

On Aug 31, 2009, at 7:26 PM, Greg Watson wrote:


Hey Ralph,

Unfortunately I don't think this is going to work for us. Most of  
the time we're starting the mpirun command using the ssh exec or  
shell service, neither of which provide any mechanism for reading  
from file descriptors other than 1 or 2. The only alternatives I  
see are:


1. Provide a separate command that starts mpirun at the end of a  
pipe that is connected to the fd passed using the -xml-fd argument.  
This command would need to be part of the OMPI distribution,  
because the whole purpose of the XML was to provide an out-of-the- 
box experience when using PTP with OMPI.


2. Implement an -xml-file option, but I could write the code for you.

3. Go back to limiting XML output to the map only.

None of these are particularly ideal. If you can think of anything  
else, let me know.


Regards,
Greg

On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote:

What if we instead offered a -xml-fd N option? I would rather not  
create a file myself. However, since you are calling mpirun  
yourself, this would allow you to create a pipe on your end, and  
then pass us the write end of the pipe. We would then send all XML  
output down that pipe.


Jeff and I chatted about this and felt this might represent the  
cleanest solution. Sound okay?



On Aug 28, 2009, at 6:33 AM, Greg Watson wrote:


Ralph,

Would this be doable? If we could guarantee that the only output  
that went to the file was XML then that would solve the problem.


Greg

On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote:


On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote:
I didn't realize it would be such a problem. Unfortunately  
there is
simply no way to reliably parse this kind of output, because it  
is

impossible to know what the error messages are going to be, and
presumably they could include XML-like formatting as well. The  
whole
point of the XML was to try and simplify the parsing of the  
mpirun

output, but it now looks like it's actually more difficult.


I thought this might be difficult when I saw you were attempting  
it.


Let me tell you about what Valgrind does because they have similar
problems.  Initially they just had added --xml=yes option which  
put most
of the valgrind (as distinct from application) output in xml  
tags.  This
works for simple cases and if you mix it with --log- 
file= it

keeps the valgrind output separate from the application output.

Unfortunately there are lots of places throughout the code where
developers have inserted print statements (in the valgrind case  
these
all go to the logfile) which means the xml is interspersed with  
non-xml

output and hence impossibly to parse reliably.

What they have now done in the current release is to add a extra
--xml-file= option as well as the --log-file=  
option.  Now
in the simple case all output from a normal run goes well  
formatted to
the xml file and the log file remains empty, any tool that wraps  
around
valgrind can parse the xml which is guaranteed to be well  
formatted and
it can detect the presence of other messages by looking for  
output in
the standard log file.  The onus is then on tool writers to look  
at the
remaining cases and decide if they are common or important  
enough to
wrap in xml and propose a patch or removal of the non-formatted  
message

entirely.

The above seems to work well, having a separate log file for xml  
is a
huge step forward as it means whilst the xml isn't necessarily  
complete
you can both parse it and are able to tell when it's missing  
something.


Of course when looking at this level of tool integration it's  
better to
use sockets that files (e.g. --xml-socket=localhost:1234 rather  
than

--xml-file=/tmp/app_.xml) but I'll leave that up to you.

I hope this gives you something to think over.

Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster co

Re: [OMPI devel] application hangs with multiple dup

2009-09-09 Thread Ashley Pittman
On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote:
> Hi,
> 
> I'm working on r21949 of the trunk.
> 
> When I run on a single node with 4 processes this simple program calling 
> 2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup.

I can't reproduce this, how often does it fail?  I've run it in a loop
hundreds of times here and not had one hang.

Off-topic I know but this is exactly the type of problem that padb is
designed to help with, if you could get it to hang and then run "padb
-axt" in another window on the same node and send along the output I'm
sure it would be of help.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] application hangs with multiple dup

2009-09-09 Thread Thomas Ropars

Ashley Pittman wrote:

On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote:
  

Hi,

I'm working on r21949 of the trunk.

When I run on a single node with 4 processes this simple program calling 
2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup.



I can't reproduce this, how often does it fail?  I've run it in a loop
hundreds of times here and not had one hang.
  
It happens once every 4 or 5 runs. And it also happens if the processes 
are on different nodes.


Here is the ouptut I get from padb -axt :

main() at ?:?
 PMPI_Comm_dup() at pcomm_dup.c:62
   ompi_comm_dup() at communicator/comm.c:661
 -
 [0,2] (2 processes)
 -
 ompi_comm_nextcid() at communicator/comm_cid.c:264
   ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
 ompi_coll_tuned_allreduce_intra_dec_fixed() at 
coll_tuned_decision_fixed.c:61
   ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
coll_tuned_allreduce.c:223

 ompi_request_default_wait_all() at request/req_wait.c:262
   opal_condition_wait() at ../opal/threads/condition.h:99
 -
 [1,3] (2 processes)
 -
 ompi_comm_nextcid() at communicator/comm_cid.c:245
   ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
 ompi_coll_tuned_allreduce_intra_dec_fixed() at 
coll_tuned_decision_fixed.c:61
   ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
coll_tuned_allreduce.c:223

 ompi_request_default_wait_all() at request/req_wait.c:262
   opal_condition_wait() at ../opal/threads/condition.h:99

Thomas

Off-topic I know but this is exactly the type of problem that padb is
designed to help with, if you could get it to hang and then run "padb
-axt" in another window on the same node and send along the output I'm
sure it would be of help.

Ashley,

  




Re: [OMPI devel] XML request

2009-09-09 Thread Ralph Castain
HmmmI never considered the possibility of output-filename being  
used that way. Interesting idea!


I can fix that one, I think - let me see what I can do.

BTW: output-filename redirects stdout, stderr, and stddiag. So you  
would get rid of everything that doesn't come through the xml path.



On Sep 9, 2009, at 7:54 AM, Greg Watson wrote:


Hi Ralph,

Looks good so far. The way I want to use is this to use /dev/tty as  
the xml-file and send any other stdout or stderr to /dev/null. I  
could use something like 'mpirun -xml-file /dev/tty  >/dev/null  
2>&1', but the syntax is shell specific which causes a problem the  
ssh exec service. I noticed that mpirun has a -output-filename  
option, but when I try -output-filename /dev/null, I get:


[Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to create  
directory (/dev), unable to set the correct mode [-1]
[Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file  
ess_hnp_module.c at line 406


Also, I'm not sure if -output-filename redirects both stdout and  
stderr, or just stdout.


Any suggestions would be appreciated.

Thanks,
Greg


On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote:

Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- 
file foo as discussed below.


You can also specify it as an MCA param: -mca orte_xml_file foo, or  
OMPI_MCA_orte_xml_file=foo


Let me know how it works
Ralph

On Aug 31, 2009, at 7:26 PM, Greg Watson wrote:


Hey Ralph,

Unfortunately I don't think this is going to work for us. Most of  
the time we're starting the mpirun command using the ssh exec or  
shell service, neither of which provide any mechanism for reading  
from file descriptors other than 1 or 2. The only alternatives I  
see are:


1. Provide a separate command that starts mpirun at the end of a  
pipe that is connected to the fd passed using the -xml-fd  
argument. This command would need to be part of the OMPI  
distribution, because the whole purpose of the XML was to provide  
an out-of-the-box experience when using PTP with OMPI.


2. Implement an -xml-file option, but I could write the code for  
you.


3. Go back to limiting XML output to the map only.

None of these are particularly ideal. If you can think of anything  
else, let me know.


Regards,
Greg

On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote:

What if we instead offered a -xml-fd N option? I would rather not  
create a file myself. However, since you are calling mpirun  
yourself, this would allow you to create a pipe on your end, and  
then pass us the write end of the pipe. We would then send all  
XML output down that pipe.


Jeff and I chatted about this and felt this might represent the  
cleanest solution. Sound okay?



On Aug 28, 2009, at 6:33 AM, Greg Watson wrote:


Ralph,

Would this be doable? If we could guarantee that the only output  
that went to the file was XML then that would solve the problem.


Greg

On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote:


On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote:
I didn't realize it would be such a problem. Unfortunately  
there is
simply no way to reliably parse this kind of output, because  
it is

impossible to know what the error messages are going to be, and
presumably they could include XML-like formatting as well. The  
whole
point of the XML was to try and simplify the parsing of the  
mpirun

output, but it now looks like it's actually more difficult.


I thought this might be difficult when I saw you were  
attempting it.


Let me tell you about what Valgrind does because they have  
similar
problems.  Initially they just had added --xml=yes option which  
put most
of the valgrind (as distinct from application) output in xml  
tags.  This
works for simple cases and if you mix it with --log- 
file= it

keeps the valgrind output separate from the application output.

Unfortunately there are lots of places throughout the code where
developers have inserted print statements (in the valgrind case  
these
all go to the logfile) which means the xml is interspersed with  
non-xml

output and hence impossibly to parse reliably.

What they have now done in the current release is to add a extra
--xml-file= option as well as the --log-file=  
option.  Now
in the simple case all output from a normal run goes well  
formatted to
the xml file and the log file remains empty, any tool that  
wraps around
valgrind can parse the xml which is guaranteed to be well  
formatted and
it can detect the presence of other messages by looking for  
output in
the standard log file.  The onus is then on tool writers to  
look at the
remaining cases and decide if they are common or important  
enough to
wrap in xml and propose a patch or removal of the non-formatted  
message

entirely.

The above seems to work well, having a separate log file for  
xml is a
huge step forward as it means whilst the xml isn't necessarily  
complete
you can both parse it and are able to tell when it's missing  
someth

Re: [OMPI devel] application hangs with multiple dup

2009-09-09 Thread Ashley Pittman
On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:

Thank you.  I think you missed the top three lines of the output but
that doesn't matter.

> main() at ?:?
>   PMPI_Comm_dup() at pcomm_dup.c:62
> ompi_comm_dup() at communicator/comm.c:661
>   -
>   [0,2] (2 processes)
>   -
>   ompi_comm_nextcid() at communicator/comm_cid.c:264
> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>   ompi_coll_tuned_allreduce_intra_dec_fixed() at 
> coll_tuned_decision_fixed.c:61
> ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
> coll_tuned_allreduce.c:223
>   ompi_request_default_wait_all() at request/req_wait.c:262
> opal_condition_wait() at ../opal/threads/condition.h:99
>   -
>   [1,3] (2 processes)
>   -
>   ompi_comm_nextcid() at communicator/comm_cid.c:245
> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>   ompi_coll_tuned_allreduce_intra_dec_fixed() at 
> coll_tuned_decision_fixed.c:61
> ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
> coll_tuned_allreduce.c:223
>   ompi_request_default_wait_all() at request/req_wait.c:262
> opal_condition_wait() at ../opal/threads/condition.h:99

Lines 264 and 245 of comm_cid.c are both in a for loop which calls
allreduce() twice in a loop until a certain condition is met.  As such
it's hard to tell from this trace if it is processes [0,2] are "ahead"
or [1,3] are "behind".  Either way you look at it however the
all_reduce() should not deadlock like that so it's as likely to be a bug
in reduce as it is in ompi_comm_nextcid() from the trace.

I assume all four processes are actually in the same call to comm_dup,
re-compiling your program with -g and re-running padb would confirm this
as it would show the line numbers.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] Error message improvement

2009-09-09 Thread Eugene Loh

N.M. Maclaren wrote:


On Sep 9 2009, George Bosilca wrote:


On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote:


does C99 complient compiler is something unusual
or is there a policy among OMPI developers/users that prevent me f
rom using __func__  instead of hardcoded strings in the code ?


__func__ is what you should use. We take care of having it defined 
in  _all_ cases. If the compiler doesn't support it we define it 
manually  (to __FUNCTION__ or to __FILE__ in the worst case), so it 
is always  available (even if it doesn't contain what one might 
expect such in  the case of __FILE__).


That's a good, practical solution.  A slight rider is that you shouldn't
be clever with it - such as using it in preprocessor statements.  I tried
some tests at one stage, and there were 'interesting' variations on how
different compilers interpreted C99.  Let alone the fact that it might
map to something else, with different rules.  If you need to play such
games, use hard-coded names.

Things may have stabilised since then, but I wouldn't bet on it.


Would it make sense for someone who understands this thread to update 
the devel FAQs?  E.g., one of:

https://svn.open-mpi.org/trac/ompi/wiki/CodingStyle
https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages


Re: [OMPI devel] Error message improvement

2009-09-09 Thread Ralph Castain

Bottom line is: just hardcode the function name.

It isn't that hard, and it avoids confusion of sometimes getting  
function names, and sometimes getting file names...which is why you'll  
find everything (at least, that I have seen) hardcoded.



On Sep 9, 2009, at 12:45 PM, Eugene Loh wrote:


N.M. Maclaren wrote:


On Sep 9 2009, George Bosilca wrote:


On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote:


does C99 complient compiler is something unusual
or is there a policy among OMPI developers/users that prevent me f
rom using __func__  instead of hardcoded strings in the code ?


__func__ is what you should use. We take care of having it defined  
in  _all_ cases. If the compiler doesn't support it we define it  
manually  (to __FUNCTION__ or to __FILE__ in the worst case), so  
it is always  available (even if it doesn't contain what one might  
expect such in  the case of __FILE__).


That's a good, practical solution.  A slight rider is that you  
shouldn't
be clever with it - such as using it in preprocessor statements.  I  
tried
some tests at one stage, and there were 'interesting' variations on  
how
different compilers interpreted C99.  Let alone the fact that it  
might
map to something else, with different rules.  If you need to play  
such

games, use hard-coded names.

Things may have stabilised since then, but I wouldn't bet on it.


Would it make sense for someone who understands this thread to  
update the devel FAQs?  E.g., one of:

https://svn.open-mpi.org/trac/ompi/wiki/CodingStyle
https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] XML request

2009-09-09 Thread Ralph Castain
Well, I fixed it so that output-filename can take /dev/null as an  
argument. Unfortunately, that doesn't help this issue as it merrily  
redirects all stdout/err from the procs to /dev/null. :-/


Of course, that -is- what output-filename was supposed to do. It is a  
way of sending the output from the procs to rank-specific files, not a  
way of redirecting mpirun's stdout/err.


I guess I don't see any way to do what you want other than to do the  
output redirection at the shell level. We face the same problem  
regarding the shell-specific syntax when we do ssh, but we get around  
it by (a) sensing the local shell type, and then (b) adding the logic  
to create the proper shell command. You are welcome to look at our  
code as an example of how to do this in C:


orte/mca/plm/rsh/plm_rsh_module.c

I would think there are Java classes already setup to resolve that  
problem, though - seems a pretty basic issue.


Sorry I can't be of more help...
Ralph


On Sep 9, 2009, at 10:17 AM, Ralph Castain wrote:

HmmmI never considered the possibility of output-filename being  
used that way. Interesting idea!


I can fix that one, I think - let me see what I can do.

BTW: output-filename redirects stdout, stderr, and stddiag. So you  
would get rid of everything that doesn't come through the xml path.



On Sep 9, 2009, at 7:54 AM, Greg Watson wrote:


Hi Ralph,

Looks good so far. The way I want to use is this to use /dev/tty as  
the xml-file and send any other stdout or stderr to /dev/null. I  
could use something like 'mpirun -xml-file /dev/tty  >/dev/null  
2>&1', but the syntax is shell specific which causes a problem the  
ssh exec service. I noticed that mpirun has a -output-filename  
option, but when I try -output-filename /dev/null, I get:


[Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to  
create directory (/dev), unable to set the correct mode [-1]
[Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file  
ess_hnp_module.c at line 406


Also, I'm not sure if -output-filename redirects both stdout and  
stderr, or just stdout.


Any suggestions would be appreciated.

Thanks,
Greg


On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote:

Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- 
file foo as discussed below.


You can also specify it as an MCA param: -mca orte_xml_file foo,  
or OMPI_MCA_orte_xml_file=foo


Let me know how it works
Ralph

On Aug 31, 2009, at 7:26 PM, Greg Watson wrote:


Hey Ralph,

Unfortunately I don't think this is going to work for us. Most of  
the time we're starting the mpirun command using the ssh exec or  
shell service, neither of which provide any mechanism for reading  
from file descriptors other than 1 or 2. The only alternatives I  
see are:


1. Provide a separate command that starts mpirun at the end of a  
pipe that is connected to the fd passed using the -xml-fd  
argument. This command would need to be part of the OMPI  
distribution, because the whole purpose of the XML was to provide  
an out-of-the-box experience when using PTP with OMPI.


2. Implement an -xml-file option, but I could write the code for  
you.


3. Go back to limiting XML output to the map only.

None of these are particularly ideal. If you can think of  
anything else, let me know.


Regards,
Greg

On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote:

What if we instead offered a -xml-fd N option? I would rather  
not create a file myself. However, since you are calling mpirun  
yourself, this would allow you to create a pipe on your end, and  
then pass us the write end of the pipe. We would then send all  
XML output down that pipe.


Jeff and I chatted about this and felt this might represent the  
cleanest solution. Sound okay?



On Aug 28, 2009, at 6:33 AM, Greg Watson wrote:


Ralph,

Would this be doable? If we could guarantee that the only  
output that went to the file was XML then that would solve the  
problem.


Greg

On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote:


On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote:
I didn't realize it would be such a problem. Unfortunately  
there is
simply no way to reliably parse this kind of output, because  
it is

impossible to know what the error messages are going to be, and
presumably they could include XML-like formatting as well.  
The whole
point of the XML was to try and simplify the parsing of the  
mpirun

output, but it now looks like it's actually more difficult.


I thought this might be difficult when I saw you were  
attempting it.


Let me tell you about what Valgrind does because they have  
similar
problems.  Initially they just had added --xml=yes option  
which put most
of the valgrind (as distinct from application) output in xml  
tags.  This
works for simple cases and if you mix it with --log- 
file= it

keeps the valgrind output separate from the application output.

Unfortunately there are lots of places throughout the code where
developers have inserted print sta