Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17983
Gotcha. Should this stuff go in ompi/config/ompi_microsoft.m4? (I don't really care; I just already see a Microsoft file, so I figured I'd ask the question) On Mar 26, 2008, at 9:54 PM, George Bosilca wrote: Interix or SUA or SFU is the POSIX layer integrated with the latest versions of Windows (such as Vista, and Server 2003). It provide fork, rsh basically most of the tools we need. george. Jeff Squyres wrote: What's Interix? On Mar 26, 2008, at 7:20 PM, bosi...@osl.iu.edu wrote: Author: bosilca Date: 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) New Revision: 17983 URL: https://svn.open-mpi.org/trac/ompi/changeset/17983 Log: Add support for Interix. Added: trunk/config/ompi_interix.m4 (contents, props changed) Text files modified: trunk/acinclude.m4 | 1 + trunk/configure.ac | 3 +++ 2 files changed, 4 insertions(+), 0 deletions(-) Modified: trunk/acinclude.m4 = = = = = = = = = = --- trunk/acinclude.m4 (original) +++ trunk/acinclude.m4 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -108,6 +108,7 @@ # Include the macros for Windows checking # m4_include(config/ompi_microsoft.m4) +m4_include(config/ompi_interix.m4) # # The config/mca_no_configure_components.m4 file is generated by Added: trunk/config/ompi_interix.m4 = = = = = = = = = = --- (empty file) +++ trunk/config/ompi_interix.m42008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -0,0 +1,56 @@ +dnl -*- shell-script -*- +dnl +dnl Copyright (c) 2008 The University of Tennessee and The University +dnl of Tennessee Research Foundation. All rights +dnl reserved. +dnl $COPYRIGHT$ +dnl +dnl Additional copyrights may follow +dnl +dnl $HEADER$ +dnl + + ## +# +# OMPI_INTERIX +# +# Detect if the environment is SUA/SFU (i.e. Interix) and modify +# the compiling environment accordingly. +# +# USAGE: +# OMPI_INTERIX() +# + ## +AC_DEFUN([OMPI_INTERIX],[ + +AC_MSG_CHECKING(for Interix environment) +AC_TRY_COMPILE([], + [#if !defined(__INTERIX) +#error Normal Unix environment +#endif], + is_interix=yes, + is_interix=no) +AC_MSG_RESULT([$is_interix]) +if test "$is_interix" = "yes"; then + +ompi_show_subtitle "Interix detection" + +if ! test -d /usr/include/port; then +AC_MSG_WARN([Compiling Open MPI under Interix require an up-to-date]) +AC_MSG_WARN([version of libport. Please ask your system administrator]) +AC_MSG_WARN([to install it (pkg_update -L libport).]) +AC_MSG_ERROR([*** Cannot continue]) +fi +# +# These are the minimum requirements for Interix ... +# +AC_MSG_WARN([-lport was added to the linking flags]) +LDFLAGS="-lport $LDFLAGS" +AC_MSG_WARN([-D_ALL_SOURCE -D_USE_LIBPORT was added to the compilation flags]) +CFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CFLAGS" +CPPFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CPPFLAGS" +CXXFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CXXFLAGS" + +fi + +]) Modified: trunk/configure.ac = = = = = = = = = = --- trunk/configure.ac (original) +++ trunk/configure.ac 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -192,6 +192,9 @@ AM_CONDITIONAL(OMPI_NEED_WINDOWS_REPLACEMENTS, test "$ompi_cv_c_compiler_vendor" = "microsoft" ) +# Do all Interix detections if necessary +OMPI_INTERIX + # Does the compiler support "ident"-like constructs? OMPI_CHECK_IDENT([CC], [CFLAGS], [c], [C]) ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RMAPS rank_file component patch and modifications for review
A few more comments on top of what Tim / Ralph said: - opal_paffinity MCA params should be defined and registered in the opal paffinity base (in the base open function so that ompi_info can still see them), not opal/runtime/opal_params.c. - I don't have a problem with setting the paffinity slot list from ompi_mpi_init, but we should probably make the corresponding MCA parameter be an "mpi_*" name; because this is functionality that is being exported through the MPI layer. Additionally, the name "mpi_" will make more sense to users; they don't know anything about opal/orte -- "mpi_" resonates with running their MPI job. - I don't think we can delete the MCA param ompi_paffinity_alone; it exists in the v1.2 series and has historical precedent. - Note that symbols that are static don't have to abide by the prefix rule. I'm not saying you need to change anything -- you don't -- I just notice that you made some symbols both static and use the prefix rule. That's fine, but if you want to use shorter symbol names for static symbols, that's fine too. On Mar 26, 2008, at 6:01 AM, Lenny Verkhovsky wrote: Hi, all Attached patch for modified Rank_File RMAPS component. 1.introduced new general purpose debug flags mpi_debug opal_debug 2.introduced new mca parameter opal_paffinity_slot_list 3.ompi_mpi_init cleaned from opal paffinity functions 4.opal paffinity functions moved to new file opal/mca/paffinity/ base/paffinity_base_service.c 5.rank_file component files were renamed according to prefix policy 6.global variables renamed as well. 7.few bug fixes that were brought during previous discussions. 8.If user defines opal_paffinity_alone and rmaps_rank_file_path or opal_paffinity_slot_list, then he gets a Warning that only opal_paffinity_alone will be used. . Best Regards, Lenny. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RMAPS rank_file component patch and modifications for review
Hi, thanks for the comments. I will definetly implement all of them and commit the code as soon as I finished. Also I experience few problems with using opal_verbose_output, either there is a bugs or I am doing something wrong. /home/USERS/lenny/OMPI_ORTE_DEBUG/bin/mpirun -mca mca_verbose 0 -mca paffinity_base_verbose 1 --byslot -np 2 -hostfile hostfile -mca btl_openib_max_lmc 1 -mca opal_paffinity_alone 1 -mca btl_openib_verbose 1 /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug -t lt /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug: symbol lookup error: /home/USERS/lenny/OMPI_ORTE_DEBUG//lib/openmpi/mca_btl_openib.so: undefined symbol: mca_btl_base_out /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug: symbol lookup error: /home/USERS/lenny/OMPI_ORTE_DEBUG//lib/openmpi/mca_btl_openib.so: undefined symbol: mca_btl_base_out -- mpirun has exited due to process rank 1 with PID 5896 on node witch17 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). On Wed, Mar 26, 2008 at 2:50 PM, Ralph H Castain wrote: > I would tend to echo Tim's suggestions. I note that you do lookup that > opal > mca param in orte as well. I know you sent me a note about that off-list - > I > apologize for not getting to it yet, but was swamped yesterday. > > I think the solution suggested in #1 below is the right approach. Looking > up > opal params in orte or ompi is probably not a good idea. We have had > problems in the past where params were looked up in multiple places as > people -do- sometimes change the names (ahem...). > > Also, I would suggest using the macro version of verbose > OPAL_OUTPUT_VERBOSE > so that it compiles out for non-debug builds - up to you. Many of us use > it > as we don't need the output from optimized builds. > > Other than that, I think this looks fine. I do truly appreciate the > cleanup > of ompi_mpi_init. > > Ralph > > > > On 3/26/08 6:09 AM, "Tim Prins" wrote: > > > Hi Lenny, > > > > This looks good. But I have a couple of suggestions (which others may > > disagree with): > > > > 1. You register an opal mca parameter, but look it up in ompi, then call > > a opal function with the result. What if you had a function > > opal_paffinity_base_set_slots(long rank) (or some other name, I don't > > care) which looked up the mca parameter and then setup the slots as you > > are doing if it is fount. This would make things a bit cleaner IMHO. > > > > 2. the functions in the paffinety base should be prefixed with > > 'opal_paffinity_base_' > > > > 3. Why was the ompi_debug_flag added? It is not used anywhere. > > > > 4. You probably do not need to add the opal debug flag. There is already > > a 'paffinity_base_verbose' flag which should suit your purposes fine. So > > you should just be able to replace all of the conditional output > > statements in paffinity with something like > > opal_output_verbose(10, opal_paffinity_base_output, ...), > > where 10 is the verbosity level number. > > > > Tim > > > > > > Lenny Verkhovsky wrote: > >> > >> > >> Hi, all > >> > >> Attached patch for modified Rank_File RMAPS component. > >> > >> > >> > >> 1.introduced new general purpose debug flags > >> > >> mpi_debug > >> > >> opal_debug > >> > >> > >> > >> 2.introduced new mca parameter opal_paffinity_slot_list > >> > >> 3.ompi_mpi_init cleaned from opal paffinity functions > >> > >> 4.opal paffinity functions moved to new file > >> opal/mca/paffinity/base/paffinity_base_service.c > >> > >> 5.rank_file component files were renamed according to prefix policy > >> > >> 6.global variables renamed as well. > >> > >> 7.few bug fixes that were brought during previous discussions. > >> > >> 8.If user defines opal_paffinity_alone and rmaps_rank_file_path or > >> opal_paffinity_slot_list, > >> > >> then he gets a Warning that only opal_paffinity_alone will be used. > >> > >> > >> > >> . > >> > >> Best Regards, > >> > >> Lenny. > >> > >> > >> > >> > >> > > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] FreeBSD timer_base_open error?
Added as https://svn.open-mpi.org/trac/ompi/ticket/1261. On Mar 26, 2008, at 11:07 AM, Brian W. Barrett wrote: George - Good catch -- that's going to cause a problem :). But I think we should add yet another check to also make sure that we're on Linux. So the three tests would be: 1) Am I on a platform that we have timer assembly support for? (That's the long list of architectures that we recently, and incorrectly, added). 2) Am I on Linux (since we really only know how to parse /proc/cpuinfo on Linux) 3) Is /proc/cpuinfo readable (Because we have a couple architectures that are reported by config.guess as Linux, but don't have /proc/cpuinfo). Make sense? Brian On Wed, 26 Mar 2008, George Bosilca wrote: I was working off-list with Brad on this. Brian is right, the logic in configure.m4 is wrong. It overwrite the timer_linux_happy to yes if the host match "i?86-*|x86_64*|ia64-*|powerpc-*|powerpc64-*|sparc*-*". On FreeBSD host is i386-unknown-freebsd6.2. Here is a quick and dirty patch. I just move the selection logic a little bit around, without any major modifications. george. Index: configure.m4 === --- configure.m4(revision 17970) +++ configure.m4(working copy) @@ -40,14 +40,12 @@ [timer_linux_happy="yes"], [timer_linux_happy="no"])]) -AS_IF([test "$timer_linux_happy" = "yes"], - [AS_IF([test -r "/proc/cpuinfo"], - [timer_linux_happy="yes"], - [timer_linux_happy="no"])]) - case "${host}" in i?86-*|x86_64*|ia64-*|powerpc-*|powerpc64-*|sparc*-*) -timer_linux_happy="yes" +AS_IF([test "$timer_linux_happy" = "yes"], + [AS_IF([test -r "/proc/cpuinfo"], + [timer_linux_happy="yes"], + [timer_linux_happy="no"])]) ;; *) timer_linux_happy="no" On Mar 25, 2008, at 10:31 PM, Brian Barrett wrote: On Mar 25, 2008, at 6:16 PM, Jeff Squyres wrote: "linux" is the name of the component. It looks like opal/mca/ timer/ linux/timer_linux_component.c is doing some checks during component open() and returning an error if it can't be used (e.g,. if it's not on linux). The timer components are a little different than normal MCA frameworks; they *must* be compiled in libopen-pal statically, and there will only be one of them built. In this case, I'm guessing that linux was built simply because nothing else was selected to be built, but then its component_open() function failed because it didn't find /proc/cpuinfo. This is actually incorrect. The linux component looks for /proc/ cpuinfo and builds if it founds that file. There's a base component that's built if nothing else is found. The configure logic for the linux component is probably not the right thing to do -- it should probably be modified to check both for that file (there are systems that call themselves "linux" but don't have a /proc/cpuinfo) is readable and that we're actually on Linux. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy' ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RMAPS rank_file component patch and modifications for review
> -Original Message- > From: Jeff Squyres [mailto:jsquy...@cisco.com] > Sent: Thursday, March 27, 2008 1:38 PM > To: Lenny Verkhovsky > Cc: Ralph H Castain; Sharon Melamed; Open MPI Developers > Subject: Re: RMAPS rank_file component patch and modifications for review > > A few more comments on top of what Tim / Ralph said: > > - opal_paffinity MCA params should be defined and registered in the > opal paffinity base (in the base open function so that ompi_info can > still see them), not opal/runtime/opal_params.c. OK. > > - I don't have a problem with setting the paffinity slot list from > ompi_mpi_init, but we should probably make the corresponding MCA > parameter be an "mpi_*" name; because this is functionality that is > being exported through the MPI layer. Additionally, the name > "mpi_" will make more sense to users; they don't know > anything about opal/orte -- "mpi_" resonates with running > their MPI job. I think in opal_paffinity_base it makes more sense and ompi_mpi_init will look cleaner. > > - I don't think we can delete the MCA param ompi_paffinity_alone; it > exists in the v1.2 series and has historical precedent. It will not be deleted, It will just use the same infrastructure ( slot_list parameter and opal_base functions ). It will be transparent for the user. User have 3 ways to setup it 1. mca opal_paffinity_alone 1 This will set paffinity as it did before 2. mca opal_paffinity_slot_list "slot_list" Used to define slots that will be used for all ranks on all nodes. 3. mca rmaps_rank_file_path rankfile Assigning ranks to CPUs according to the file Rank_file_path can be used with opal_paffinity_slot_list In this case all undefined by rankfile ranks will be assigned by opal_paffinity_slot_list mca parameter. > > - Note that symbols that are static don't have to abide by the prefix > rule. I'm not saying you need to change anything -- you don't -- I > just notice that you made some symbols both static and use the prefix > rule. That's fine, but if you want to use shorter symbol names for > static symbols, that's fine too. > > > > On Mar 26, 2008, at 6:01 AM, Lenny Verkhovsky wrote: > > > > Hi, all > > Attached patch for modified Rank_File RMAPS component. > > > > 1.introduced new general purpose debug flags > > mpi_debug > > opal_debug > > > > 2.introduced new mca parameter opal_paffinity_slot_list > > 3.ompi_mpi_init cleaned from opal paffinity functions > > 4.opal paffinity functions moved to new file opal/mca/paffinity/ > > base/paffinity_base_service.c > > 5.rank_file component files were renamed according to prefix > > policy > > 6.global variables renamed as well. > > 7.few bug fixes that were brought during previous discussions. > > 8.If user defines opal_paffinity_alone and rmaps_rank_file_path > > or opal_paffinity_slot_list, > > then he gets a Warning that only opal_paffinity_alone will be used. > > > > . > > Best Regards, > > Lenny. > > > > > > > -- > Jeff Squyres > Cisco Systems
Re: [OMPI devel] RMAPS rank_file component patch and modifications for review
Are you using BTL_OUTPUT or something else from btl_base_error.h? On Mar 27, 2008, at 7:49 AM, Lenny Verkhovsky wrote: Hi, thanks for the comments. I will definetly implement all of them and commit the code as soon as I finished. Also I experience few problems with using opal_verbose_output, either there is a bugs or I am doing something wrong. /home/USERS/lenny/OMPI_ORTE_DEBUG/bin/mpirun -mca mca_verbose 0 -mca paffinity_base_verbose 1 --byslot -np 2 -hostfile hostfile -mca btl_openib_max_lmc 1 -mca opal_paffinity_alone 1 -mca btl_openib_verbose 1 /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug -t lt /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug: symbol lookup error: / home/USERS/lenny/OMPI_ORTE_DEBUG//lib/openmpi/mca_btl_openib.so: undefined symbol: mca_btl_base_out /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug: symbol lookup error: / home/USERS/lenny/OMPI_ORTE_DEBUG//lib/openmpi/mca_btl_openib.so: undefined symbol: mca_btl_base_out -- mpirun has exited due to process rank 1 with PID 5896 on node witch17 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). On Wed, Mar 26, 2008 at 2:50 PM, Ralph H Castain wrote: I would tend to echo Tim's suggestions. I note that you do lookup that opal mca param in orte as well. I know you sent me a note about that off- list - I apologize for not getting to it yet, but was swamped yesterday. I think the solution suggested in #1 below is the right approach. Looking up opal params in orte or ompi is probably not a good idea. We have had problems in the past where params were looked up in multiple places as people -do- sometimes change the names (ahem...). Also, I would suggest using the macro version of verbose OPAL_OUTPUT_VERBOSE so that it compiles out for non-debug builds - up to you. Many of us use it as we don't need the output from optimized builds. Other than that, I think this looks fine. I do truly appreciate the cleanup of ompi_mpi_init. Ralph On 3/26/08 6:09 AM, "Tim Prins" wrote: > Hi Lenny, > > This looks good. But I have a couple of suggestions (which others may > disagree with): > > 1. You register an opal mca parameter, but look it up in ompi, then call > a opal function with the result. What if you had a function > opal_paffinity_base_set_slots(long rank) (or some other name, I don't > care) which looked up the mca parameter and then setup the slots as you > are doing if it is fount. This would make things a bit cleaner IMHO. > > 2. the functions in the paffinety base should be prefixed with > 'opal_paffinity_base_' > > 3. Why was the ompi_debug_flag added? It is not used anywhere. > > 4. You probably do not need to add the opal debug flag. There is already > a 'paffinity_base_verbose' flag which should suit your purposes fine. So > you should just be able to replace all of the conditional output > statements in paffinity with something like > opal_output_verbose(10, opal_paffinity_base_output, ...), > where 10 is the verbosity level number. > > Tim > > > Lenny Verkhovsky wrote: >> >> >> Hi, all >> >> Attached patch for modified Rank_File RMAPS component. >> >> >> >> 1.introduced new general purpose debug flags >> >> mpi_debug >> >> opal_debug >> >> >> >> 2.introduced new mca parameter opal_paffinity_slot_list >> >> 3.ompi_mpi_init cleaned from opal paffinity functions >> >> 4.opal paffinity functions moved to new file >> opal/mca/paffinity/base/paffinity_base_service.c >> >> 5.rank_file component files were renamed according to prefix policy >> >> 6.global variables renamed as well. >> >> 7.few bug fixes that were brought during previous discussions. >> >> 8.If user defines opal_paffinity_alone and rmaps_rank_file_path or >> opal_paffinity_slot_list, >> >> then he gets a Warning that only opal_paffinity_alone will be used. >> >> >> >> . >> >> Best Regards, >> >> Lenny. >> >> >> >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk segfault
Lenny -- Did this get fixed? We were mucking with some mca param stuff on the trunk yesterday; not sure if it was related to this failure or not. On Mar 26, 2008, at 10:34 AM, Lenny Verkhovsky wrote: Hi, all I compiled and builded source from trunk and it causes segfault /home/USERS/lenny/OMPI_ORTE_NEW/bin/mpirun -np 1 -H witch17 /home/ USERS/lenny/TESTS/ORTE/mpi_p01_NEW -t lt -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_mpi_register_params() failed --> Returned "Error" (-1) instead of "Success" (0) -- [witch17:01220] *** Process received signal *** [witch17:01220] Signal: Segmentation fault (11) [witch17:01220] Signal code: (128) [witch17:01220] Failing at address: (nil) [witch17:01220] [ 0] /lib64/libpthread.so.0 [0x2aadf7072c10] [witch17:01220] [ 1] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libopen- pal.so.0(free+0x56) [0x2aadf6acb6d6] [witch17:01220] [ 2] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libopen- pal.so.0(opal_argv_free+0x25) [0x2aadf6ab9635] [witch17:01220] [ 3] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libmpi.so.0 [0x2aadf67f4206] [witch17:01220] [ 4] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libmpi.so. 0(MPI_Init+0xf0) [0x2aadf68117c0] [witch17:01220] [ 5] /home/USERS/lenny/TESTS/ORTE/mpi_p01_NEW(main +0xef) [0x40109f] [witch17:01220] [ 6] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2aadf7199154] [witch17:01220] [ 7] /home/USERS/lenny/TESTS/ORTE/mpi_p01_NEW [0x400ee9] [witch17:01220] *** End of error message *** -- mpirun noticed that process rank 0 with PID 1220 on node witch17 exited on signal 11 (Segmentation fault). ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RMAPS rank_file component patch and modifications for review
NO, just tried to see some printouts during the run, I use in the code opal_output_verbose(0, 0,"LNY100 opal_paffinity_base_slot_list_set ver=%d ",0); opal_output_verbose(1, 0,"LNY101 opal_paffinity_base_slot_list_set ver=%d ",1); OPAL_OUTPUT_VERBOSE((1, 0,"VERBOSE LNY102 opal_paffinity_base_slot_list_set ver=%d ",1)); but all I see is the first line ( since I put level 0) I suppose that to see the second line I must configure with --enable-debug, but this is not working for me either. On Thu, Mar 27, 2008 at 2:02 PM, Jeff Squyres wrote: > Are you using BTL_OUTPUT or something else from btl_base_error.h? > > > On Mar 27, 2008, at 7:49 AM, Lenny Verkhovsky wrote: > > Hi, > > thanks for the comments. I will definetly implement all of them and > > commit the code as soon as I finished. > > > > Also I experience few problems with using opal_verbose_output, > > either there is a bugs or I am doing something wrong. > > > > > > /home/USERS/lenny/OMPI_ORTE_DEBUG/bin/mpirun -mca mca_verbose 0 -mca > > paffinity_base_verbose 1 --byslot -np 2 -hostfile hostfile -mca > > btl_openib_max_lmc 1 -mca opal_paffinity_alone 1 -mca > > btl_openib_verbose 1 /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug -t lt > > > > > > /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug: symbol lookup error: / > > home/USERS/lenny/OMPI_ORTE_DEBUG//lib/openmpi/mca_btl_openib.so: > > undefined symbol: mca_btl_base_out > > /home/USERS/lenny/TESTS/ORTE/mpi_p01_debug: symbol lookup error: / > > home/USERS/lenny/OMPI_ORTE_DEBUG//lib/openmpi/mca_btl_openib.so: > > undefined symbol: mca_btl_base_out > > > -- > > mpirun has exited due to process rank 1 with PID 5896 on > > node witch17 exiting without calling "finalize". This may > > have caused other processes in the application to be > > terminated by signals sent by mpirun (as reported here). > > > > > > On Wed, Mar 26, 2008 at 2:50 PM, Ralph H Castain wrote: > > I would tend to echo Tim's suggestions. I note that you do lookup > > that opal > > mca param in orte as well. I know you sent me a note about that off- > > list - I > > apologize for not getting to it yet, but was swamped yesterday. > > > > I think the solution suggested in #1 below is the right approach. > > Looking up > > opal params in orte or ompi is probably not a good idea. We have had > > problems in the past where params were looked up in multiple places as > > people -do- sometimes change the names (ahem...). > > > > Also, I would suggest using the macro version of verbose > > OPAL_OUTPUT_VERBOSE > > so that it compiles out for non-debug builds - up to you. Many of us > > use it > > as we don't need the output from optimized builds. > > > > Other than that, I think this looks fine. I do truly appreciate the > > cleanup > > of ompi_mpi_init. > > > > Ralph > > > > > > > > On 3/26/08 6:09 AM, "Tim Prins" wrote: > > > > > Hi Lenny, > > > > > > This looks good. But I have a couple of suggestions (which others > > may > > > disagree with): > > > > > > 1. You register an opal mca parameter, but look it up in ompi, > > then call > > > a opal function with the result. What if you had a function > > > opal_paffinity_base_set_slots(long rank) (or some other name, I > > don't > > > care) which looked up the mca parameter and then setup the slots > > as you > > > are doing if it is fount. This would make things a bit cleaner IMHO. > > > > > > 2. the functions in the paffinety base should be prefixed with > > > 'opal_paffinity_base_' > > > > > > 3. Why was the ompi_debug_flag added? It is not used anywhere. > > > > > > 4. You probably do not need to add the opal debug flag. There is > > already > > > a 'paffinity_base_verbose' flag which should suit your purposes > > fine. So > > > you should just be able to replace all of the conditional output > > > statements in paffinity with something like > > > opal_output_verbose(10, opal_paffinity_base_output, ...), > > > where 10 is the verbosity level number. > > > > > > Tim > > > > > > > > > Lenny Verkhovsky wrote: > > >> > > >> > > >> Hi, all > > >> > > >> Attached patch for modified Rank_File RMAPS component. > > >> > > >> > > >> > > >> 1.introduced new general purpose debug flags > > >> > > >> mpi_debug > > >> > > >> opal_debug > > >> > > >> > > >> > > >> 2.introduced new mca parameter opal_paffinity_slot_list > > >> > > >> 3.ompi_mpi_init cleaned from opal paffinity functions > > >> > > >> 4.opal paffinity functions moved to new file > > >> opal/mca/paffinity/base/paffinity_base_service.c > > >> > > >> 5.rank_file component files were renamed according to prefix > > policy > > >> > > >> 6.global variables renamed as well. > > >> > > >> 7.few bug fixes that were brought during previous discussions. > > >> > > >> 8.If user defines opal_paffinity_alone and > > rmaps_rank_file_path or > > >> opal_paffinity_slot_list, > > >> > > >> then he gets a Warning that only opal_paffinity_alone wil
Re: [OMPI devel] trunk segfault
yes, thanks. On Thu, Mar 27, 2008 at 2:07 PM, Jeff Squyres wrote: > Lenny -- > > Did this get fixed? We were mucking with some mca param stuff on the > trunk yesterday; not sure if it was related to this failure or not. > > > On Mar 26, 2008, at 10:34 AM, Lenny Verkhovsky wrote: > > Hi, all > > > > I compiled and builded source from trunk > > and it causes segfault > > > > /home/USERS/lenny/OMPI_ORTE_NEW/bin/mpirun -np 1 -H witch17 /home/ > > USERS/lenny/TESTS/ORTE/mpi_p01_NEW -t lt > > > > > -- > > It looks like MPI_INIT failed for some reason; your parallel process > > is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or > > environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > mca_mpi_register_params() failed > > --> Returned "Error" (-1) instead of "Success" (0) > > > -- > > [witch17:01220] *** Process received signal *** > > [witch17:01220] Signal: Segmentation fault (11) > > [witch17:01220] Signal code: (128) > > [witch17:01220] Failing at address: (nil) > > [witch17:01220] [ 0] /lib64/libpthread.so.0 [0x2aadf7072c10] > > [witch17:01220] [ 1] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libopen- > > pal.so.0(free+0x56) [0x2aadf6acb6d6] > > [witch17:01220] [ 2] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libopen- > > pal.so.0(opal_argv_free+0x25) [0x2aadf6ab9635] > > [witch17:01220] [ 3] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libmpi.so.0 > > [0x2aadf67f4206] > > [witch17:01220] [ 4] /home/USERS/lenny/OMPI_ORTE_NEW/lib/libmpi.so. > > 0(MPI_Init+0xf0) [0x2aadf68117c0] > > [witch17:01220] [ 5] /home/USERS/lenny/TESTS/ORTE/mpi_p01_NEW(main > > +0xef) [0x40109f] > > [witch17:01220] [ 6] /lib64/libc.so.6(__libc_start_main+0xf4) > > [0x2aadf7199154] > > [witch17:01220] [ 7] /home/USERS/lenny/TESTS/ORTE/mpi_p01_NEW > > [0x400ee9] > > [witch17:01220] *** End of error message *** > > > -- > > mpirun noticed that process rank 0 with PID 1220 on node witch17 > > exited on signal 11 (Segmentation fault). > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
This commit breaks things for me. Running on 3 nodes of odin: mpirun -mca btl tcp,sm,self examples/ring_c causes a hang. All of the processes are stuck in orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, and the ring program does not hang all the time, but fairly often. Tim r...@osl.iu.edu wrote: Author: rhc Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) New Revision: 17941 URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 Log: Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending. Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them. Added: trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c Text files modified: trunk/orte/mca/grpcomm/base/Makefile.am| 5 trunk/orte/mca/grpcomm/base/base.h |23 + trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- trunk/orte/mca/grpcomm/basic/grpcomm_basic.h |16 trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 - trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c| 845 ++- trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c |21 trunk/orte/mca/grpcomm/grpcomm.h |45 + trunk/orte/mca/rml/rml_types.h |31 trunk/orte/orted/orted_comm.c |27 + 14 files changed, 226 insertions(+), 959 deletions(-) Diff not shown due to size (92619 bytes). To see the diff, run the following command: svn diff -r 17940:17941 --no-diff-deleted ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
Thanks Tim - I found the problem and will commit a fix shortly. Appreciate your testing and reporting! On 3/27/08 8:24 AM, "Tim Prins" wrote: > This commit breaks things for me. Running on 3 nodes of odin: > > mpirun -mca btl tcp,sm,self examples/ring_c > > causes a hang. All of the processes are stuck in > orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, > and the ring program does not hang all the time, but fairly often. > > Tim > > r...@osl.iu.edu wrote: >> Author: rhc >> Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) >> New Revision: 17941 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 >> >> Log: >> Fix the allgather and allgather_list functions to avoid deadlocks at large >> node/proc counts. Violated the RML rules here - we received the allgather >> buffer and then did an xcast, which causes a send to go out, and is then >> subsequently received by the sender. This fix breaks that pattern by forcing >> the recv to complete outside of the function itself - thus, the allgather and >> allgather_list always complete their recvs before returning or sending. >> >> Reogranize the grpcomm code a little to provide support for soon-to-come new >> grpcomm components. The revised organization puts what will be common code >> elements in the base to avoid duplication, while allowing components that >> don't need those functions to ignore them. >> >> Added: >>trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c >>trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c >>trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c >> Text files modified: >>trunk/orte/mca/grpcomm/base/Makefile.am| 5 >>trunk/orte/mca/grpcomm/base/base.h |23 + >>trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 >>trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 >>trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- >>trunk/orte/mca/grpcomm/basic/grpcomm_basic.h |16 >>trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 - >>trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c| 845 >> ++- >>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 >>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 >>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c |21 >>trunk/orte/mca/grpcomm/grpcomm.h |45 + >>trunk/orte/mca/rml/rml_types.h |31 >>trunk/orte/orted/orted_comm.c |27 + >>14 files changed, 226 insertions(+), 959 deletions(-) >> >> >> Diff not shown due to size (92619 bytes). >> To see the diff, run the following command: >> >> svn diff -r 17940:17941 --no-diff-deleted >> >> ___ >> svn mailing list >> s...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/svn > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17983
Well, technically speaking Interix it's not ... 100% Microsoft, even if now it's somehow integrated in Windows. It does not support the standard Windows environment (such as windows.h) nor the compilers. It come with gcc (3.3), and most of the Unix tools. george. On Mar 27, 2008, at 6:13 AM, Jeff Squyres wrote: Gotcha. Should this stuff go in ompi/config/ompi_microsoft.m4? (I don't really care; I just already see a Microsoft file, so I figured I'd ask the question) On Mar 26, 2008, at 9:54 PM, George Bosilca wrote: Interix or SUA or SFU is the POSIX layer integrated with the latest versions of Windows (such as Vista, and Server 2003). It provide fork, rsh basically most of the tools we need. george. Jeff Squyres wrote: What's Interix? On Mar 26, 2008, at 7:20 PM, bosi...@osl.iu.edu wrote: Author: bosilca Date: 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) New Revision: 17983 URL: https://svn.open-mpi.org/trac/ompi/changeset/17983 Log: Add support for Interix. Added: trunk/config/ompi_interix.m4 (contents, props changed) Text files modified: trunk/acinclude.m4 | 1 + trunk/configure.ac | 3 +++ 2 files changed, 4 insertions(+), 0 deletions(-) Modified: trunk/acinclude.m4 = = = = = = = = = = = === --- trunk/acinclude.m4 (original) +++ trunk/acinclude.m4 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -108,6 +108,7 @@ # Include the macros for Windows checking # m4_include(config/ompi_microsoft.m4) +m4_include(config/ompi_interix.m4) # # The config/mca_no_configure_components.m4 file is generated by Added: trunk/config/ompi_interix.m4 = = = = = = = = = = = === --- (empty file) +++ trunk/config/ompi_interix.m42008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -0,0 +1,56 @@ +dnl -*- shell-script -*- +dnl +dnl Copyright (c) 2008 The University of Tennessee and The University +dnl of Tennessee Research Foundation. All rights +dnl reserved. +dnl $COPYRIGHT$ +dnl +dnl Additional copyrights may follow +dnl +dnl $HEADER$ +dnl + + ## +# +# OMPI_INTERIX +# +# Detect if the environment is SUA/SFU (i.e. Interix) and modify +# the compiling environment accordingly. +# +# USAGE: +# OMPI_INTERIX() +# + ## +AC_DEFUN([OMPI_INTERIX],[ + +AC_MSG_CHECKING(for Interix environment) +AC_TRY_COMPILE([], + [#if !defined(__INTERIX) +#error Normal Unix environment +#endif], + is_interix=yes, + is_interix=no) +AC_MSG_RESULT([$is_interix]) +if test "$is_interix" = "yes"; then + +ompi_show_subtitle "Interix detection" + +if ! test -d /usr/include/port; then +AC_MSG_WARN([Compiling Open MPI under Interix require an up-to-date]) +AC_MSG_WARN([version of libport. Please ask your system administrator]) +AC_MSG_WARN([to install it (pkg_update -L libport).]) +AC_MSG_ERROR([*** Cannot continue]) +fi +# +# These are the minimum requirements for Interix ... +# +AC_MSG_WARN([-lport was added to the linking flags]) +LDFLAGS="-lport $LDFLAGS" +AC_MSG_WARN([-D_ALL_SOURCE -D_USE_LIBPORT was added to the compilation flags]) +CFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CFLAGS" +CPPFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CPPFLAGS" +CXXFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CXXFLAGS" + +fi + +]) Modified: trunk/configure.ac = = = = = = = = = = = === --- trunk/configure.ac (original) +++ trunk/configure.ac 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -192,6 +192,9 @@ AM_CONDITIONAL(OMPI_NEED_WINDOWS_REPLACEMENTS, test "$ompi_cv_c_compiler_vendor" = "microsoft" ) +# Do all Interix detections if necessary +OMPI_INTERIX + # Does the compiler support "ident"-like constructs? OMPI_CHECK_IDENT([CC], [CFLAGS], [c], [C]) ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17983
Gotcha; thanks for the explanation. On Mar 27, 2008, at 10:58 AM, George Bosilca wrote: Well, technically speaking Interix it's not ... 100% Microsoft, even if now it's somehow integrated in Windows. It does not support the standard Windows environment (such as windows.h) nor the compilers. It come with gcc (3.3), and most of the Unix tools. george. On Mar 27, 2008, at 6:13 AM, Jeff Squyres wrote: Gotcha. Should this stuff go in ompi/config/ompi_microsoft.m4? (I don't really care; I just already see a Microsoft file, so I figured I'd ask the question) On Mar 26, 2008, at 9:54 PM, George Bosilca wrote: Interix or SUA or SFU is the POSIX layer integrated with the latest versions of Windows (such as Vista, and Server 2003). It provide fork, rsh basically most of the tools we need. george. Jeff Squyres wrote: What's Interix? On Mar 26, 2008, at 7:20 PM, bosi...@osl.iu.edu wrote: Author: bosilca Date: 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) New Revision: 17983 URL: https://svn.open-mpi.org/trac/ompi/changeset/17983 Log: Add support for Interix. Added: trunk/config/ompi_interix.m4 (contents, props changed) Text files modified: trunk/acinclude.m4 | 1 + trunk/configure.ac | 3 +++ 2 files changed, 4 insertions(+), 0 deletions(-) Modified: trunk/acinclude.m4 = = = = = = = = = = = = == --- trunk/acinclude.m4 (original) +++ trunk/acinclude.m4 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -108,6 +108,7 @@ # Include the macros for Windows checking # m4_include(config/ompi_microsoft.m4) +m4_include(config/ompi_interix.m4) # # The config/mca_no_configure_components.m4 file is generated by Added: trunk/config/ompi_interix.m4 = = = = = = = = = = = = == --- (empty file) +++ trunk/config/ompi_interix.m42008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -0,0 +1,56 @@ +dnl -*- shell-script -*- +dnl +dnl Copyright (c) 2008 The University of Tennessee and The University +dnl of Tennessee Research Foundation. All rights +dnl reserved. +dnl $COPYRIGHT$ +dnl +dnl Additional copyrights may follow +dnl +dnl $HEADER$ +dnl + + ## +# +# OMPI_INTERIX +# +# Detect if the environment is SUA/SFU (i.e. Interix) and modify +# the compiling environment accordingly. +# +# USAGE: +# OMPI_INTERIX() +# + ## +AC_DEFUN([OMPI_INTERIX],[ + +AC_MSG_CHECKING(for Interix environment) +AC_TRY_COMPILE([], + [#if !defined(__INTERIX) +#error Normal Unix environment +#endif], + is_interix=yes, + is_interix=no) +AC_MSG_RESULT([$is_interix]) +if test "$is_interix" = "yes"; then + +ompi_show_subtitle "Interix detection" + +if ! test -d /usr/include/port; then +AC_MSG_WARN([Compiling Open MPI under Interix require an up-to-date]) +AC_MSG_WARN([version of libport. Please ask your system administrator]) +AC_MSG_WARN([to install it (pkg_update -L libport).]) +AC_MSG_ERROR([*** Cannot continue]) +fi +# +# These are the minimum requirements for Interix ... +# +AC_MSG_WARN([-lport was added to the linking flags]) +LDFLAGS="-lport $LDFLAGS" +AC_MSG_WARN([-D_ALL_SOURCE -D_USE_LIBPORT was added to the compilation flags]) +CFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/port $CFLAGS" +CPPFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/ port $CPPFLAGS" +CXXFLAGS="-D_ALL_SOURCE -D_USE_LIBPORT -I/usr/include/ port $CXXFLAGS" + +fi + +]) Modified: trunk/configure.ac = = = = = = = = = = = = == --- trunk/configure.ac (original) +++ trunk/configure.ac 2008-03-26 19:20:33 EDT (Wed, 26 Mar 2008) @@ -192,6 +192,9 @@ AM_CONDITIONAL(OMPI_NEED_WINDOWS_REPLACEMENTS, test "$ompi_cv_c_compiler_vendor" = "microsoft" ) +# Do all Interix detections if necessary +OMPI_INTERIX + # Does the compiler support "ident"-like constructs? OMPI_CHECK_IDENT([CC], [CFLAGS], [c], [C]) ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
Unfortunately now with r17988 I cannot run any mpi programs, they seem to hang in the modex. Tim Ralph H Castain wrote: Thanks Tim - I found the problem and will commit a fix shortly. Appreciate your testing and reporting! On 3/27/08 8:24 AM, "Tim Prins" wrote: This commit breaks things for me. Running on 3 nodes of odin: mpirun -mca btl tcp,sm,self examples/ring_c causes a hang. All of the processes are stuck in orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, and the ring program does not hang all the time, but fairly often. Tim r...@osl.iu.edu wrote: Author: rhc Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) New Revision: 17941 URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 Log: Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending. Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them. Added: trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c Text files modified: trunk/orte/mca/grpcomm/base/Makefile.am| 5 trunk/orte/mca/grpcomm/base/base.h |23 + trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- trunk/orte/mca/grpcomm/basic/grpcomm_basic.h |16 trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 - trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c| 845 ++- trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c |21 trunk/orte/mca/grpcomm/grpcomm.h |45 + trunk/orte/mca/rml/rml_types.h |31 trunk/orte/orted/orted_comm.c |27 + 14 files changed, 226 insertions(+), 959 deletions(-) Diff not shown due to size (92619 bytes). To see the diff, run the following command: svn diff -r 17940:17941 --no-diff-deleted ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
Hmmm...puzzling. It is working fine for me on TM machines and on my Mac. However, Galen reports it borked on alps as well. I'll have to dig a little to check this out and see if there is something missing on those PLMs. Will get back shortly. Sorry for problem On 3/27/08 10:28 AM, "Tim Prins" wrote: > Unfortunately now with r17988 I cannot run any mpi programs, they seem > to hang in the modex. > > Tim > > Ralph H Castain wrote: >> Thanks Tim - I found the problem and will commit a fix shortly. >> >> Appreciate your testing and reporting! >> >> >> On 3/27/08 8:24 AM, "Tim Prins" wrote: >> >>> This commit breaks things for me. Running on 3 nodes of odin: >>> >>> mpirun -mca btl tcp,sm,self examples/ring_c >>> >>> causes a hang. All of the processes are stuck in >>> orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, >>> and the ring program does not hang all the time, but fairly often. >>> >>> Tim >>> >>> r...@osl.iu.edu wrote: Author: rhc Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) New Revision: 17941 URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 Log: Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending. Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them. Added: trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c Text files modified: trunk/orte/mca/grpcomm/base/Makefile.am| 5 trunk/orte/mca/grpcomm/base/base.h |23 + trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- trunk/orte/mca/grpcomm/basic/grpcomm_basic.h |16 trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 - trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c| 845 ++- trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c |21 trunk/orte/mca/grpcomm/grpcomm.h |45 + trunk/orte/mca/rml/rml_types.h |31 trunk/orte/orted/orted_comm.c |27 + 14 files changed, 226 insertions(+), 959 deletions(-) Diff not shown due to size (92619 bytes). To see the diff, run the following command: svn diff -r 17940:17941 --no-diff-deleted ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
Found the problem - should have a fix committed soon. Issue is with differences in the number of daemons launched by the various plms (whether or not procs are launched local to mpirun). On 3/27/08 10:39 AM, "Ralph H Castain" wrote: > Hmmm...puzzling. It is working fine for me on TM machines and on my Mac. > However, Galen reports it borked on alps as well. > > I'll have to dig a little to check this out and see if there is something > missing on those PLMs. Will get back shortly. > > Sorry for problem > > > On 3/27/08 10:28 AM, "Tim Prins" wrote: > >> Unfortunately now with r17988 I cannot run any mpi programs, they seem >> to hang in the modex. >> >> Tim >> >> Ralph H Castain wrote: >>> Thanks Tim - I found the problem and will commit a fix shortly. >>> >>> Appreciate your testing and reporting! >>> >>> >>> On 3/27/08 8:24 AM, "Tim Prins" wrote: >>> This commit breaks things for me. Running on 3 nodes of odin: mpirun -mca btl tcp,sm,self examples/ring_c causes a hang. All of the processes are stuck in orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, and the ring program does not hang all the time, but fairly often. Tim r...@osl.iu.edu wrote: > Author: rhc > Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) > New Revision: 17941 > URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 > > Log: > Fix the allgather and allgather_list functions to avoid deadlocks at large > node/proc counts. Violated the RML rules here - we received the allgather > buffer and then did an xcast, which causes a send to go out, and is then > subsequently received by the sender. This fix breaks that pattern by > forcing > the recv to complete outside of the function itself - thus, the allgather > and > allgather_list always complete their recvs before returning or sending. > > Reogranize the grpcomm code a little to provide support for soon-to-come > new > grpcomm components. The revised organization puts what will be common code > elements in the base to avoid duplication, while allowing components that > don't need those functions to ignore them. > > Added: >trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c >trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c >trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c > Text files modified: >trunk/orte/mca/grpcomm/base/Makefile.am| 5 >trunk/orte/mca/grpcomm/base/base.h |23 + >trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 >trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 >trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- >trunk/orte/mca/grpcomm/basic/grpcomm_basic.h |16 >trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 - >trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c| 845 > ++- >trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 >trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 >trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c |21 >trunk/orte/mca/grpcomm/grpcomm.h |45 + >trunk/orte/mca/rml/rml_types.h |31 >trunk/orte/orted/orted_comm.c |27 + >14 files changed, 226 insertions(+), 959 deletions(-) > > > Diff not shown due to size (92619 bytes). > To see the diff, run the following command: > > svn diff -r 17940:17941 --no-diff-deleted > > ___ > svn mailing list > s...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
Appears fixed with r17992 - at least, it works on TM, slurm (odin), and Mac. On 3/27/08 11:06 AM, "Ralph H Castain" wrote: > Found the problem - should have a fix committed soon. Issue is with > differences in the number of daemons launched by the various plms (whether > or not procs are launched local to mpirun). > > > > On 3/27/08 10:39 AM, "Ralph H Castain" wrote: > >> Hmmm...puzzling. It is working fine for me on TM machines and on my Mac. >> However, Galen reports it borked on alps as well. >> >> I'll have to dig a little to check this out and see if there is something >> missing on those PLMs. Will get back shortly. >> >> Sorry for problem >> >> >> On 3/27/08 10:28 AM, "Tim Prins" wrote: >> >>> Unfortunately now with r17988 I cannot run any mpi programs, they seem >>> to hang in the modex. >>> >>> Tim >>> >>> Ralph H Castain wrote: Thanks Tim - I found the problem and will commit a fix shortly. Appreciate your testing and reporting! On 3/27/08 8:24 AM, "Tim Prins" wrote: > This commit breaks things for me. Running on 3 nodes of odin: > > mpirun -mca btl tcp,sm,self examples/ring_c > > causes a hang. All of the processes are stuck in > orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, > and the ring program does not hang all the time, but fairly often. > > Tim > > r...@osl.iu.edu wrote: >> Author: rhc >> Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008) >> New Revision: 17941 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/17941 >> >> Log: >> Fix the allgather and allgather_list functions to avoid deadlocks at >> large >> node/proc counts. Violated the RML rules here - we received the allgather >> buffer and then did an xcast, which causes a send to go out, and is then >> subsequently received by the sender. This fix breaks that pattern by >> forcing >> the recv to complete outside of the function itself - thus, the allgather >> and >> allgather_list always complete their recvs before returning or sending. >> >> Reogranize the grpcomm code a little to provide support for soon-to-come >> new >> grpcomm components. The revised organization puts what will be common >> code >> elements in the base to avoid duplication, while allowing components that >> don't need those functions to ignore them. >> >> Added: >>trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c >>trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c >>trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c >> Text files modified: >>trunk/orte/mca/grpcomm/base/Makefile.am| 5 >>trunk/orte/mca/grpcomm/base/base.h |23 + >>trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4 >>trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 >>trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++--- >>trunk/orte/mca/grpcomm/basic/grpcomm_basic.h |16 >>trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 - >>trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c| 845 >> ++- >>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 >>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8 >>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c |21 >>trunk/orte/mca/grpcomm/grpcomm.h |45 + >>trunk/orte/mca/rml/rml_types.h |31 >>trunk/orte/orted/orted_comm.c |27 + >>14 files changed, 226 insertions(+), 959 deletions(-) >> >> >> Diff not shown due to size (92619 bytes). >> To see the diff, run the following command: >> >> svn diff -r 17940:17941 --no-diff-deleted >> >> ___ >> svn mailing list >> s...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/svn > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Switching away from SVN?
On Mon, Mar 24, 2008 at 04:00:18PM -0400, George Bosilca wrote: > After playing with hg and git for few days, I tend to agree with the > emacs guys. It looks to me that any of them will do the job (as did > svn). I don't really care which one will be selected by the community > as long as we: > 1. Don't spend months in deciding which one to choose. > 2. Don't loose the nice integration o svn with our TRAC. Independent > on how good/fast the dVCS is, the way svn integrate with trac is a > real time saver. Tracking bugs, linking to revisions and to the wiki > are really important features to me, and I think that whatever our > decision will be we should not lose this. For what it's worth I noticed this on one of the plents I read this morning. It looks like someone has already done the work to use git as a backend for TRAC http://www.terdmonk.com/using+git+as+a+trac+versioning+system+backend Yours Tony linux.conf.auhttp://www.marchsouth.org/ Jan 19 - 24 2009 The Australian Linux Technical Conference!