Re: [OMPI devel] Error message improvement
__FUNCTION__ is not portable. __func__ is but it needs a C99 compliant compiler. --Nysal On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky wrote: > fixed in r21952 > thanks. > > On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet wrote: > >> Lenny Verkhovsky wrote: >> >>> Why not using __FUNCTION__ in all our error messages ??? >>> >> >> Sounds good, this way the function names are always correct. >> >> -- >> Greetings, A. Huillet >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Error message improvement
Hi All, does C99 complient compiler is something unusual or is there a policy among OMPI developers/users that prevent me f rom using __func__ instead of hardcoded strings in the code ? Thanks. Lenny. On Wed, Sep 9, 2009 at 1:48 PM, Nysal Jan wrote: > __FUNCTION__ is not portable. > __func__ is but it needs a C99 compliant compiler. > > --Nysal > > On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky < > lenny.verkhov...@gmail.com> wrote: > >> fixed in r21952 >> thanks. >> >> On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet >> wrote: >> >>> Lenny Verkhovsky wrote: >>> Why not using __FUNCTION__ in all our error messages ??? >>> >>> Sounds good, this way the function names are always correct. >>> >>> -- >>> Greetings, A. Huillet >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Error message improvement
__func__ is what you should use. We take care of having it defined in _all_ cases. If the compiler doesn't support it we define it manually (to __FUNCTION__ or to __FILE__ in the worst case), so it is always available (even if it doesn't contain what one might expect such in the case of __FILE__). george. On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote: Hi All, does C99 complient compiler is something unusual or is there a policy among OMPI developers/users that prevent me f rom using __func__ instead of hardcoded strings in the code ? Thanks. Lenny. On Wed, Sep 9, 2009 at 1:48 PM, Nysal Jan wrote: __FUNCTION__ is not portable. __func__ is but it needs a C99 compliant compiler. --Nysal On Tue, Sep 8, 2009 at 9:06 PM, Lenny Verkhovsky > wrote: fixed in r21952 thanks. On Tue, Sep 8, 2009 at 5:08 PM, Arthur Huillet > wrote: Lenny Verkhovsky wrote: Why not using __FUNCTION__ in all our error messages ??? Sounds good, this way the function names are always correct. -- Greetings, A. Huillet ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Error message improvement
On Sep 9 2009, George Bosilca wrote: On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote: does C99 complient compiler is something unusual or is there a policy among OMPI developers/users that prevent me f rom using __func__ instead of hardcoded strings in the code ? __func__ is what you should use. We take care of having it defined in _all_ cases. If the compiler doesn't support it we define it manually (to __FUNCTION__ or to __FILE__ in the worst case), so it is always available (even if it doesn't contain what one might expect such in the case of __FILE__). That's a good, practical solution. A slight rider is that you shouldn't be clever with it - such as using it in preprocessor statements. I tried some tests at one stage, and there were 'interesting' variations on how different compilers interpreted C99. Let alone the fact that it might map to something else, with different rules. If you need to play such games, use hard-coded names. Things may have stabilised since then, but I wouldn't bet on it. Regards, Nick Maclaren.
Re: [OMPI devel] Error message improvement
fixed in r21956 __FUNCTION__ was replaced with __func__ thanks. Lenny. On Wed, Sep 9, 2009 at 2:59 PM, N.M. Maclaren wrote: > On Sep 9 2009, George Bosilca wrote: > >> On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote: >> >> does C99 complient compiler is something unusual >>> or is there a policy among OMPI developers/users that prevent me f >>> rom using __func__ instead of hardcoded strings in the code ? >>> >> >> __func__ is what you should use. We take care of having it defined in >> _all_ cases. If the compiler doesn't support it we define it manually (to >> __FUNCTION__ or to __FILE__ in the worst case), so it is always available >> (even if it doesn't contain what one might expect such in the case of >> __FILE__). >> > > That's a good, practical solution. A slight rider is that you shouldn't > be clever with it - such as using it in preprocessor statements. I tried > some tests at one stage, and there were 'interesting' variations on how > different compilers interpreted C99. Let alone the fact that it might > map to something else, with different rules. If you need to play such > games, use hard-coded names. > > Things may have stabilised since then, but I wouldn't bet on it. > > Regards, > Nick Maclaren. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] fix 2014: Problems in romio
I have seen that ROMIO goes wrong with fix 2014: A lot of ROMIO tests in ompi/mca/io/romio/romio/test/ are failing For example, with noncontig_coll2: [inti15:28259] *** Process received signal *** [inti15:28259] Signal: Segmentation fault (11) [inti15:28259] Signal code: Address not mapped (1) [inti15:28259] Failing at address: (nil) [inti15:28259] [ 0] /lib64/libpthread.so.0 [0x3f19c0e4c0] [inti15:28259] [ 1] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_btl_openib.so [0x2b6640c74d79] [inti15:28259] [ 2] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_rml_oob.so [0x2b663e2e6e92] [inti15:28259] [ 3] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2b663e4f8e63] [inti15:28259] [ 4] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_oob_tcp.so [0x2b663e4ff485] [inti15:28259] [ 5] /home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_event_loop+0x5df) [0x2b663d3d92ff] [inti15:28259] [ 6] /home_nfs/devezep/ATLAS/openmpi-default/lib/libopen-pal.so.0(opal_progress+0x5e) [0x2b663d3ba33e] [inti15:28259] [ 7] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0 [0x2b663ce26624] [inti15:28259] [ 8] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2b664217fda2] [inti15:28259] [ 9] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_coll_tuned.so [0x2b6642179966] [inti15:28259] [10] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_Alltoall+0x6f) [0x2b663ce352ef] [inti15:28259] [11] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_Calc_others_req+0x65) [0x2aaab1cfc525] [inti15:28259] [12] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(ADIOI_GEN_WriteStridedColl+0x433) [0x2aaab1cf0ac3] [inti15:28259] [13] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(MPIOI_File_write_all+0xc0) [0x2aaab1d0a8f0] [inti15:28259] [14] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_write_all+0x23) [0x2aaab1d0a823] [inti15:28259] [15] /home_nfs/devezep/ATLAS/openmpi-default/lib/openmpi/mca_io_romio.so [0x2aaab1cedce9] [inti15:28259] [16] /home_nfs/devezep/ATLAS/openmpi-default/lib/libmpi.so.0(MPI_File_write_all+0x4e) [0x2b663ce64f9e] [inti15:28259] [17] ./noncontig_coll2(test_file+0x32b) [0x4034bb] [inti15:28259] [18] ./noncontig_coll2(main+0x58b) [0x402d03] [inti15:28259] [19] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f1901d974] [inti15:28259] [20] ./noncontig_coll2 [0x4026c9] [inti15:28259] *** End of error message *** All the ROMIO tests pass without this fix Is there a problem in ROMIO with the datatype interface ? Pascal Here is the export of the corresponding patch: hg export 16301 # HG changeset patch # User rusraink # Date 1251912841 0 # Node ID eefd4bd4551969dc7454e63c2f42871cc9376a8f # Parent 8aab76743e58474f1341be6f9d0ac9ae338507f1 - This fixes #2014: As noted in http://www.open-mpi.org/community/lists/devel/2009/08/6741.php, we do not correctly free a dupped predefined datatype. The fix is a bit more involving. See ticket for details. Tested with ibm tests and mpi_test_suite (though there's two "old" failures zero5.c and zero6.c) Thanks to Lisandro Dalcin for bringing this up. diff -r 8aab76743e58 -r eefd4bd45519 ompi/datatype/ompi_datatype.h --- a/ompi/datatype/ompi_datatype.h Wed Sep 02 11:23:54 2009 + +++ b/ompi/datatype/ompi_datatype.h Wed Sep 02 17:34:01 2009 + @@ -202,11 +202,14 @@ } opal_datatype_clone ( &oldType->super, &new_ompi_datatype->super); +new_ompi_datatype->super.flags &= (~OMPI_DATATYPE_FLAG_PREDEFINED); + /* Set the keyhash to NULL -- copying attributes is *only* done at the top level (specifically, MPI_TYPE_DUP). */ new_ompi_datatype->d_keyhash = NULL; new_ompi_datatype->args = NULL; -strncpy (new_ompi_datatype->name, oldType->name, MPI_MAX_OBJECT_NAME); +snprintf (new_ompi_datatype->name, MPI_MAX_OBJECT_NAME, "Dup %s", + oldType->name); return OMPI_SUCCESS; } diff -r 8aab76743e58 -r eefd4bd45519 opal/datatype/opal_datatype_clone.c --- a/opal/datatype/opal_datatype_clone.c Wed Sep 02 11:23:54 2009 + +++ b/opal/datatype/opal_datatype_clone.c Wed Sep 02 17:34:01 2009 + @@ -33,9 +33,13 @@ int32_t opal_datatype_clone( const opal_datatype_t * src_type, opal_datatype_t * dest_type ) { int32_t desc_length = src_type->desc.used + 1; /* +1 because of the fake OPAL_DATATYPE_END_LOOP entry */ -dt_elem_desc_t* temp = dest_type->desc.desc; /* temporary copy of the desc pointer */ +dt_elem_desc_t* temp = dest_type->desc.desc;/* temporary copy of the desc pointer */ -memcpy( dest_type, src_type, sizeof(opal_datatype_t) ); +/* copy _excluding_ the super object, we want to keep the cls_destruct_array */ +memcpy( dest_type+sizeof(opal_object_t), +src_type+sizeof(opal_object_t), +sizeof(opal_datatype_t)-sizeof(opal_object_t) )
Re: [OMPI devel] XML request
Hi Ralph, Looks good so far. The way I want to use is this to use /dev/tty as the xml-file and send any other stdout or stderr to /dev/null. I could use something like 'mpirun -xml-file /dev/tty >/dev/null 2>&1', but the syntax is shell specific which causes a problem the ssh exec service. I noticed that mpirun has a -output-filename option, but when I try -output-filename /dev/null, I get: [Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to create directory (/dev), unable to set the correct mode [-1] [Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 406 Also, I'm not sure if -output-filename redirects both stdout and stderr, or just stdout. Any suggestions would be appreciated. Thanks, Greg On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote: Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- file foo as discussed below. You can also specify it as an MCA param: -mca orte_xml_file foo, or OMPI_MCA_orte_xml_file=foo Let me know how it works Ralph On Aug 31, 2009, at 7:26 PM, Greg Watson wrote: Hey Ralph, Unfortunately I don't think this is going to work for us. Most of the time we're starting the mpirun command using the ssh exec or shell service, neither of which provide any mechanism for reading from file descriptors other than 1 or 2. The only alternatives I see are: 1. Provide a separate command that starts mpirun at the end of a pipe that is connected to the fd passed using the -xml-fd argument. This command would need to be part of the OMPI distribution, because the whole purpose of the XML was to provide an out-of-the- box experience when using PTP with OMPI. 2. Implement an -xml-file option, but I could write the code for you. 3. Go back to limiting XML output to the map only. None of these are particularly ideal. If you can think of anything else, let me know. Regards, Greg On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote: What if we instead offered a -xml-fd N option? I would rather not create a file myself. However, since you are calling mpirun yourself, this would allow you to create a pipe on your end, and then pass us the write end of the pipe. We would then send all XML output down that pipe. Jeff and I chatted about this and felt this might represent the cleanest solution. Sound okay? On Aug 28, 2009, at 6:33 AM, Greg Watson wrote: Ralph, Would this be doable? If we could guarantee that the only output that went to the file was XML then that would solve the problem. Greg On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote: On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote: I didn't realize it would be such a problem. Unfortunately there is simply no way to reliably parse this kind of output, because it is impossible to know what the error messages are going to be, and presumably they could include XML-like formatting as well. The whole point of the XML was to try and simplify the parsing of the mpirun output, but it now looks like it's actually more difficult. I thought this might be difficult when I saw you were attempting it. Let me tell you about what Valgrind does because they have similar problems. Initially they just had added --xml=yes option which put most of the valgrind (as distinct from application) output in xml tags. This works for simple cases and if you mix it with --log- file= it keeps the valgrind output separate from the application output. Unfortunately there are lots of places throughout the code where developers have inserted print statements (in the valgrind case these all go to the logfile) which means the xml is interspersed with non-xml output and hence impossibly to parse reliably. What they have now done in the current release is to add a extra --xml-file= option as well as the --log-file= option. Now in the simple case all output from a normal run goes well formatted to the xml file and the log file remains empty, any tool that wraps around valgrind can parse the xml which is guaranteed to be well formatted and it can detect the presence of other messages by looking for output in the standard log file. The onus is then on tool writers to look at the remaining cases and decide if they are common or important enough to wrap in xml and propose a patch or removal of the non-formatted message entirely. The above seems to work well, having a separate log file for xml is a huge step forward as it means whilst the xml isn't necessarily complete you can both parse it and are able to tell when it's missing something. Of course when looking at this level of tool integration it's better to use sockets that files (e.g. --xml-socket=localhost:1234 rather than --xml-file=/tmp/app_.xml) but I'll leave that up to you. I hope this gives you something to think over. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster co
Re: [OMPI devel] application hangs with multiple dup
On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote: > Hi, > > I'm working on r21949 of the trunk. > > When I run on a single node with 4 processes this simple program calling > 2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup. I can't reproduce this, how often does it fail? I've run it in a loop hundreds of times here and not had one hang. Off-topic I know but this is exactly the type of problem that padb is designed to help with, if you could get it to hang and then run "padb -axt" in another window on the same node and send along the output I'm sure it would be of help. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] application hangs with multiple dup
Ashley Pittman wrote: On Tue, 2009-09-08 at 15:00 +0200, Thomas Ropars wrote: Hi, I'm working on r21949 of the trunk. When I run on a single node with 4 processes this simple program calling 2 times MPI_Comm_dup , the processes hang from time to time in the 2nd dup. I can't reproduce this, how often does it fail? I've run it in a loop hundreds of times here and not had one hang. It happens once every 4 or 5 runs. And it also happens if the processes are on different nodes. Here is the ouptut I get from padb -axt : main() at ?:? PMPI_Comm_dup() at pcomm_dup.c:62 ompi_comm_dup() at communicator/comm.c:661 - [0,2] (2 processes) - ompi_comm_nextcid() at communicator/comm_cid.c:264 ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 ompi_coll_tuned_allreduce_intra_dec_fixed() at coll_tuned_decision_fixed.c:61 ompi_coll_tuned_allreduce_intra_recursivedoubling() at coll_tuned_allreduce.c:223 ompi_request_default_wait_all() at request/req_wait.c:262 opal_condition_wait() at ../opal/threads/condition.h:99 - [1,3] (2 processes) - ompi_comm_nextcid() at communicator/comm_cid.c:245 ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 ompi_coll_tuned_allreduce_intra_dec_fixed() at coll_tuned_decision_fixed.c:61 ompi_coll_tuned_allreduce_intra_recursivedoubling() at coll_tuned_allreduce.c:223 ompi_request_default_wait_all() at request/req_wait.c:262 opal_condition_wait() at ../opal/threads/condition.h:99 Thomas Off-topic I know but this is exactly the type of problem that padb is designed to help with, if you could get it to hang and then run "padb -axt" in another window on the same node and send along the output I'm sure it would be of help. Ashley,
Re: [OMPI devel] XML request
HmmmI never considered the possibility of output-filename being used that way. Interesting idea! I can fix that one, I think - let me see what I can do. BTW: output-filename redirects stdout, stderr, and stddiag. So you would get rid of everything that doesn't come through the xml path. On Sep 9, 2009, at 7:54 AM, Greg Watson wrote: Hi Ralph, Looks good so far. The way I want to use is this to use /dev/tty as the xml-file and send any other stdout or stderr to /dev/null. I could use something like 'mpirun -xml-file /dev/tty >/dev/null 2>&1', but the syntax is shell specific which causes a problem the ssh exec service. I noticed that mpirun has a -output-filename option, but when I try -output-filename /dev/null, I get: [Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to create directory (/dev), unable to set the correct mode [-1] [Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 406 Also, I'm not sure if -output-filename redirects both stdout and stderr, or just stdout. Any suggestions would be appreciated. Thanks, Greg On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote: Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- file foo as discussed below. You can also specify it as an MCA param: -mca orte_xml_file foo, or OMPI_MCA_orte_xml_file=foo Let me know how it works Ralph On Aug 31, 2009, at 7:26 PM, Greg Watson wrote: Hey Ralph, Unfortunately I don't think this is going to work for us. Most of the time we're starting the mpirun command using the ssh exec or shell service, neither of which provide any mechanism for reading from file descriptors other than 1 or 2. The only alternatives I see are: 1. Provide a separate command that starts mpirun at the end of a pipe that is connected to the fd passed using the -xml-fd argument. This command would need to be part of the OMPI distribution, because the whole purpose of the XML was to provide an out-of-the-box experience when using PTP with OMPI. 2. Implement an -xml-file option, but I could write the code for you. 3. Go back to limiting XML output to the map only. None of these are particularly ideal. If you can think of anything else, let me know. Regards, Greg On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote: What if we instead offered a -xml-fd N option? I would rather not create a file myself. However, since you are calling mpirun yourself, this would allow you to create a pipe on your end, and then pass us the write end of the pipe. We would then send all XML output down that pipe. Jeff and I chatted about this and felt this might represent the cleanest solution. Sound okay? On Aug 28, 2009, at 6:33 AM, Greg Watson wrote: Ralph, Would this be doable? If we could guarantee that the only output that went to the file was XML then that would solve the problem. Greg On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote: On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote: I didn't realize it would be such a problem. Unfortunately there is simply no way to reliably parse this kind of output, because it is impossible to know what the error messages are going to be, and presumably they could include XML-like formatting as well. The whole point of the XML was to try and simplify the parsing of the mpirun output, but it now looks like it's actually more difficult. I thought this might be difficult when I saw you were attempting it. Let me tell you about what Valgrind does because they have similar problems. Initially they just had added --xml=yes option which put most of the valgrind (as distinct from application) output in xml tags. This works for simple cases and if you mix it with --log- file= it keeps the valgrind output separate from the application output. Unfortunately there are lots of places throughout the code where developers have inserted print statements (in the valgrind case these all go to the logfile) which means the xml is interspersed with non-xml output and hence impossibly to parse reliably. What they have now done in the current release is to add a extra --xml-file= option as well as the --log-file= option. Now in the simple case all output from a normal run goes well formatted to the xml file and the log file remains empty, any tool that wraps around valgrind can parse the xml which is guaranteed to be well formatted and it can detect the presence of other messages by looking for output in the standard log file. The onus is then on tool writers to look at the remaining cases and decide if they are common or important enough to wrap in xml and propose a patch or removal of the non-formatted message entirely. The above seems to work well, having a separate log file for xml is a huge step forward as it means whilst the xml isn't necessarily complete you can both parse it and are able to tell when it's missing someth
Re: [OMPI devel] application hangs with multiple dup
On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote: Thank you. I think you missed the top three lines of the output but that doesn't matter. > main() at ?:? > PMPI_Comm_dup() at pcomm_dup.c:62 > ompi_comm_dup() at communicator/comm.c:661 > - > [0,2] (2 processes) > - > ompi_comm_nextcid() at communicator/comm_cid.c:264 > ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 > ompi_coll_tuned_allreduce_intra_dec_fixed() at > coll_tuned_decision_fixed.c:61 > ompi_coll_tuned_allreduce_intra_recursivedoubling() at > coll_tuned_allreduce.c:223 > ompi_request_default_wait_all() at request/req_wait.c:262 > opal_condition_wait() at ../opal/threads/condition.h:99 > - > [1,3] (2 processes) > - > ompi_comm_nextcid() at communicator/comm_cid.c:245 > ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 > ompi_coll_tuned_allreduce_intra_dec_fixed() at > coll_tuned_decision_fixed.c:61 > ompi_coll_tuned_allreduce_intra_recursivedoubling() at > coll_tuned_allreduce.c:223 > ompi_request_default_wait_all() at request/req_wait.c:262 > opal_condition_wait() at ../opal/threads/condition.h:99 Lines 264 and 245 of comm_cid.c are both in a for loop which calls allreduce() twice in a loop until a certain condition is met. As such it's hard to tell from this trace if it is processes [0,2] are "ahead" or [1,3] are "behind". Either way you look at it however the all_reduce() should not deadlock like that so it's as likely to be a bug in reduce as it is in ompi_comm_nextcid() from the trace. I assume all four processes are actually in the same call to comm_dup, re-compiling your program with -g and re-running padb would confirm this as it would show the line numbers. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] Error message improvement
N.M. Maclaren wrote: On Sep 9 2009, George Bosilca wrote: On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote: does C99 complient compiler is something unusual or is there a policy among OMPI developers/users that prevent me f rom using __func__ instead of hardcoded strings in the code ? __func__ is what you should use. We take care of having it defined in _all_ cases. If the compiler doesn't support it we define it manually (to __FUNCTION__ or to __FILE__ in the worst case), so it is always available (even if it doesn't contain what one might expect such in the case of __FILE__). That's a good, practical solution. A slight rider is that you shouldn't be clever with it - such as using it in preprocessor statements. I tried some tests at one stage, and there were 'interesting' variations on how different compilers interpreted C99. Let alone the fact that it might map to something else, with different rules. If you need to play such games, use hard-coded names. Things may have stabilised since then, but I wouldn't bet on it. Would it make sense for someone who understands this thread to update the devel FAQs? E.g., one of: https://svn.open-mpi.org/trac/ompi/wiki/CodingStyle https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
Re: [OMPI devel] Error message improvement
Bottom line is: just hardcode the function name. It isn't that hard, and it avoids confusion of sometimes getting function names, and sometimes getting file names...which is why you'll find everything (at least, that I have seen) hardcoded. On Sep 9, 2009, at 12:45 PM, Eugene Loh wrote: N.M. Maclaren wrote: On Sep 9 2009, George Bosilca wrote: On Sep 9, 2009, at 14:16 , Lenny Verkhovsky wrote: does C99 complient compiler is something unusual or is there a policy among OMPI developers/users that prevent me f rom using __func__ instead of hardcoded strings in the code ? __func__ is what you should use. We take care of having it defined in _all_ cases. If the compiler doesn't support it we define it manually (to __FUNCTION__ or to __FILE__ in the worst case), so it is always available (even if it doesn't contain what one might expect such in the case of __FILE__). That's a good, practical solution. A slight rider is that you shouldn't be clever with it - such as using it in preprocessor statements. I tried some tests at one stage, and there were 'interesting' variations on how different compilers interpreted C99. Let alone the fact that it might map to something else, with different rules. If you need to play such games, use hard-coded names. Things may have stabilised since then, but I wouldn't bet on it. Would it make sense for someone who understands this thread to update the devel FAQs? E.g., one of: https://svn.open-mpi.org/trac/ompi/wiki/CodingStyle https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] XML request
Well, I fixed it so that output-filename can take /dev/null as an argument. Unfortunately, that doesn't help this issue as it merrily redirects all stdout/err from the procs to /dev/null. :-/ Of course, that -is- what output-filename was supposed to do. It is a way of sending the output from the procs to rank-specific files, not a way of redirecting mpirun's stdout/err. I guess I don't see any way to do what you want other than to do the output redirection at the shell level. We face the same problem regarding the shell-specific syntax when we do ssh, but we get around it by (a) sensing the local shell type, and then (b) adding the logic to create the proper shell command. You are welcome to look at our code as an example of how to do this in C: orte/mca/plm/rsh/plm_rsh_module.c I would think there are Java classes already setup to resolve that problem, though - seems a pretty basic issue. Sorry I can't be of more help... Ralph On Sep 9, 2009, at 10:17 AM, Ralph Castain wrote: HmmmI never considered the possibility of output-filename being used that way. Interesting idea! I can fix that one, I think - let me see what I can do. BTW: output-filename redirects stdout, stderr, and stddiag. So you would get rid of everything that doesn't come through the xml path. On Sep 9, 2009, at 7:54 AM, Greg Watson wrote: Hi Ralph, Looks good so far. The way I want to use is this to use /dev/tty as the xml-file and send any other stdout or stderr to /dev/null. I could use something like 'mpirun -xml-file /dev/tty >/dev/null 2>&1', but the syntax is shell specific which causes a problem the ssh exec service. I noticed that mpirun has a -output-filename option, but when I try -output-filename /dev/null, I get: [Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to create directory (/dev), unable to set the correct mode [-1] [Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 406 Also, I'm not sure if -output-filename redirects both stdout and stderr, or just stdout. Any suggestions would be appreciated. Thanks, Greg On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote: Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- file foo as discussed below. You can also specify it as an MCA param: -mca orte_xml_file foo, or OMPI_MCA_orte_xml_file=foo Let me know how it works Ralph On Aug 31, 2009, at 7:26 PM, Greg Watson wrote: Hey Ralph, Unfortunately I don't think this is going to work for us. Most of the time we're starting the mpirun command using the ssh exec or shell service, neither of which provide any mechanism for reading from file descriptors other than 1 or 2. The only alternatives I see are: 1. Provide a separate command that starts mpirun at the end of a pipe that is connected to the fd passed using the -xml-fd argument. This command would need to be part of the OMPI distribution, because the whole purpose of the XML was to provide an out-of-the-box experience when using PTP with OMPI. 2. Implement an -xml-file option, but I could write the code for you. 3. Go back to limiting XML output to the map only. None of these are particularly ideal. If you can think of anything else, let me know. Regards, Greg On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote: What if we instead offered a -xml-fd N option? I would rather not create a file myself. However, since you are calling mpirun yourself, this would allow you to create a pipe on your end, and then pass us the write end of the pipe. We would then send all XML output down that pipe. Jeff and I chatted about this and felt this might represent the cleanest solution. Sound okay? On Aug 28, 2009, at 6:33 AM, Greg Watson wrote: Ralph, Would this be doable? If we could guarantee that the only output that went to the file was XML then that would solve the problem. Greg On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote: On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote: I didn't realize it would be such a problem. Unfortunately there is simply no way to reliably parse this kind of output, because it is impossible to know what the error messages are going to be, and presumably they could include XML-like formatting as well. The whole point of the XML was to try and simplify the parsing of the mpirun output, but it now looks like it's actually more difficult. I thought this might be difficult when I saw you were attempting it. Let me tell you about what Valgrind does because they have similar problems. Initially they just had added --xml=yes option which put most of the valgrind (as distinct from application) output in xml tags. This works for simple cases and if you mix it with --log- file= it keeps the valgrind output separate from the application output. Unfortunately there are lots of places throughout the code where developers have inserted print sta