Re: [OMPI devel] binding with MCA parameters: broken or user error?
In regards to the "-mca XXX" option not overriding the file setting I thought I saw this working for v1.3. However, I just retested this and I am seeing the same issue of the "-mca" option not affecting orte_process_binding or rmaps_base_schedule_policy. This seems to work under the trunk. I wonder if the issue might be something we did in r22050 where we stopped calling orte_register_params twice? Not sure exactly why that would have prevented the mca option setting taking place the first time. --td Ralph Castain wrote: Try adding -display-devel-map to your cmd line so you can see what OMPI thinks the binding and mapping policy is set to - that'll tell you if the problem is in the mapping or in the daemon binding. Also, it might help to know something about this node - like how many sockets, cores/socket. On Oct 8, 2009, at 11:17 PM, Eugene Loh wrote: Here are two problems with openmpi-1.3.4a1r22051 # Here, I try to run the moral equivalent of -bysocket -bind-to-socket, # using the MCA parameter form specified on the mpirun command line. # No binding results. THIS IS PROBLEM 1. % mpirun -np 5 --mca rmaps_base_schedule_policy socket --mca orte_process_binding socket -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Same thing with the "core" form. % mpirun -np 5 --mca rmaps_base_schedule_policy core --mca orte_process_binding core -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Now, I set the MCA parameters as environment variables. # I then check the spellings and confirm all is set using ompi_info. % setenv OMPI_MCA_rmaps_base_schedule_policy socket % setenv OMPI_MCA_orte_process_binding socket % ompi_info -a | grep rmaps_base_schedule_policy MCA rmaps: parameter "rmaps_base_schedule_policy" (current value: "socket", data source: environment) % ompi_info -a | grep orte_process_binding MCA orte: parameter "orte_process_binding" (current value: "socket", data source: environment) # So, now I run a simple program. # I get binding now, but I'm filling up the first socket before going to the second. # THIS IS PROBLEM 2. % mpirun -np 5 -report-bindings hostname [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],0] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],1] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],2] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],3] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],4] to socket 1 cpus 00f0 saem9 saem9 saem9 saem9 saem9 # Adding -bysocket to the command line fixes things. % mpirun -np 5 -bysocket -report-bindings hostname [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],0] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],1] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],2] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],3] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],4] to socket 0 cpus 000f saem9 saem9 saem9 saem9 saem9 Bug? Or am I doing something wrong? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] segv in coll tuned
Hi, I experience the following error with current trunk r22090. It also occures in 1.3 branch. #~/work/svn/ompi/branches/1.3//build_x86-64/install/bin/mpirun -H witch21 -np 4 -mca coll_tuned_use_dynamic_rules 1 ./IMB-MPI1 Sometimes it's error, and sometimes it's segv. It recreates with np>4. [witch21:26540] *** An error occurred in MPI_Barrier [witch21:26540] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 [witch21:26540] *** MPI_ERR_ARG: invalid argument of some other kind [witch21:26540] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -- mpirun has exited due to process rank 0 with PID 26540 on node witch21 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- 3 total processes killed (some possibly by mpirun during cleanup) thanks Lenny.
Re: [OMPI devel] segv in coll tuned
Does that test also pass sometimes? I am seeing some random set of tests segv'ing in the SM btl, using a v1.3 derivative. --td Lenny Verkhovsky wrote: Hi, I experience the following error with current trunk r22090. It also occures in 1.3 branch. #~/work/svn/ompi/branches/1.3//build_x86-64/install/bin/mpirun -H witch21 -np 4 -mca coll_tuned_use_dynamic_rules 1 ./IMB-MPI1 Sometimes it's error, and sometimes it's segv. It recreates with np>4. [witch21:26540] *** An error occurred in MPI_Barrier [witch21:26540] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 [witch21:26540] *** MPI_ERR_ARG: invalid argument of some other kind [witch21:26540] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -- mpirun has exited due to process rank 0 with PID 26540 on node witch21 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- 3 total processes killed (some possibly by mpirun during cleanup) thanks Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] segv in coll tuned
not since I started testing it :) it failes somewhere in ompi_coll_tuned_get_target_method_params function, I am taking a look right now. On Mon, Oct 12, 2009 at 3:33 PM, Terry Dontje wrote: > Does that test also pass sometimes? I am seeing some random set of tests > segv'ing in the SM btl, using a v1.3 derivative. > > --td > Lenny Verkhovsky wrote: > >> Hi, >> I experience the following error with current trunk r22090. It also >> occures in 1.3 branch. >> #~/work/svn/ompi/branches/1.3//build_x86-64/install/bin/mpirun -H witch21 >> -np 4 -mca coll_tuned_use_dynamic_rules 1 ./IMB-MPI1 Sometimes it's error, >> and sometimes it's segv. It recreates with np>4. >> [witch21:26540] *** An error occurred in MPI_Barrier >> [witch21:26540] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 >> [witch21:26540] *** MPI_ERR_ARG: invalid argument of some other kind >> [witch21:26540] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> -- >> mpirun has exited due to process rank 0 with PID 26540 on >> node witch21 exiting without calling "finalize". This may >> have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> -- >> 3 total processes killed (some possibly by mpirun during cleanup) >> >> thanks >> Lenny. >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] segv in coll tuned
well, I see that it returnes 0 at this line, since base_com_rule->n_msg_sizes==0 coll_tuned_dynamic_rules.c +359 if( (NULL == base_com_rule) || (0 == base_com_rule->n_msg_sizes)) { return (0); } Sometimes it passes if I tell IMB -npmin 4. On Mon, Oct 12, 2009 at 3:37 PM, Lenny Verkhovsky < lenny.verkhov...@gmail.com> wrote: > not since I started testing it :) > it failes somewhere in ompi_coll_tuned_get_target_method_params function, I > am taking a look right now. > > On Mon, Oct 12, 2009 at 3:33 PM, Terry Dontje wrote: > >> Does that test also pass sometimes? I am seeing some random set of tests >> segv'ing in the SM btl, using a v1.3 derivative. >> >> --td >> Lenny Verkhovsky wrote: >> >>> Hi, >>> I experience the following error with current trunk r22090. It also >>> occures in 1.3 branch. >>> #~/work/svn/ompi/branches/1.3//build_x86-64/install/bin/mpirun -H witch21 >>> -np 4 -mca coll_tuned_use_dynamic_rules 1 ./IMB-MPI1 Sometimes it's error, >>> and sometimes it's segv. It recreates with np>4. >>> [witch21:26540] *** An error occurred in MPI_Barrier >>> [witch21:26540] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 >>> [witch21:26540] *** MPI_ERR_ARG: invalid argument of some other kind >>> [witch21:26540] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >>> >>> -- >>> mpirun has exited due to process rank 0 with PID 26540 on >>> node witch21 exiting without calling "finalize". This may >>> have caused other processes in the application to be >>> terminated by signals sent by mpirun (as reported here). >>> >>> -- >>> 3 total processes killed (some possibly by mpirun during cleanup) >>> >>> thanks >>> Lenny. >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >
Re: [OMPI devel] binding with MCA parameters: broken or user error?
I fixed the process schedule issue on the trunk over the weekend (not moved to 1.3 yet while it "soaked") - the binding issue was working fine on the trunk. I believe I applied the fix to stop calling register_params twice to 1.3 already, but I can check. On Oct 12, 2009, at 4:36 AM, Terry Dontje wrote: In regards to the "-mca XXX" option not overriding the file setting I thought I saw this working for v1.3. However, I just retested this and I am seeing the same issue of the "-mca" option not affecting orte_process_binding or rmaps_base_schedule_policy. This seems to work under the trunk. I wonder if the issue might be something we did in r22050 where we stopped calling orte_register_params twice? Not sure exactly why that would have prevented the mca option setting taking place the first time. --td Ralph Castain wrote: Try adding -display-devel-map to your cmd line so you can see what OMPI thinks the binding and mapping policy is set to - that'll tell you if the problem is in the mapping or in the daemon binding. Also, it might help to know something about this node - like how many sockets, cores/socket. On Oct 8, 2009, at 11:17 PM, Eugene Loh wrote: Here are two problems with openmpi-1.3.4a1r22051 # Here, I try to run the moral equivalent of -bysocket -bind-to- socket, # using the MCA parameter form specified on the mpirun command line. # No binding results. THIS IS PROBLEM 1. % mpirun -np 5 --mca rmaps_base_schedule_policy socket --mca orte_process_binding socket -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Same thing with the "core" form. % mpirun -np 5 --mca rmaps_base_schedule_policy core --mca orte_process_binding core -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Now, I set the MCA parameters as environment variables. # I then check the spellings and confirm all is set using ompi_info. % setenv OMPI_MCA_rmaps_base_schedule_policy socket % setenv OMPI_MCA_orte_process_binding socket % ompi_info -a | grep rmaps_base_schedule_policy MCA rmaps: parameter "rmaps_base_schedule_policy" (current value: "socket", data source: environment) % ompi_info -a | grep orte_process_binding MCA orte: parameter "orte_process_binding" (current value: "socket", data source: environment) # So, now I run a simple program. # I get binding now, but I'm filling up the first socket before going to the second. # THIS IS PROBLEM 2. % mpirun -np 5 -report-bindings hostname [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],0] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],1] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],2] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],3] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],4] to socket 1 cpus 00f0 saem9 saem9 saem9 saem9 saem9 # Adding -bysocket to the command line fixes things. % mpirun -np 5 -bysocket -report-bindings hostname [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],0] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],1] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],2] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],3] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],4] to socket 0 cpus 000f saem9 saem9 saem9 saem9 saem9 Bug? Or am I doing something wrong? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] binding with MCA parameters: broken or user error?
Ralph Castain wrote: I fixed the process schedule issue on the trunk over the weekend (not moved to 1.3 yet while it "soaked") - the binding issue was working fine on the trunk. So there was an issue of "-mca orte_process_binding" not being interpreted? I believe I applied the fix to stop calling register_params twice to 1.3 already, but I can check. No I was asking whether that fix might be causing the orte_process_binding mca param to not be interpreted. But I think from what you say in the first paragraph I guess I probably was wrong. --td On Oct 12, 2009, at 4:36 AM, Terry Dontje wrote: In regards to the "-mca XXX" option not overriding the file setting I thought I saw this working for v1.3. However, I just retested this and I am seeing the same issue of the "-mca" option not affecting orte_process_binding or rmaps_base_schedule_policy. This seems to work under the trunk. I wonder if the issue might be something we did in r22050 where we stopped calling orte_register_params twice? Not sure exactly why that would have prevented the mca option setting taking place the first time. --td Ralph Castain wrote: Try adding -display-devel-map to your cmd line so you can see what OMPI thinks the binding and mapping policy is set to - that'll tell you if the problem is in the mapping or in the daemon binding. Also, it might help to know something about this node - like how many sockets, cores/socket. On Oct 8, 2009, at 11:17 PM, Eugene Loh wrote: Here are two problems with openmpi-1.3.4a1r22051 # Here, I try to run the moral equivalent of -bysocket -bind-to-socket, # using the MCA parameter form specified on the mpirun command line. # No binding results. THIS IS PROBLEM 1. % mpirun -np 5 --mca rmaps_base_schedule_policy socket --mca orte_process_binding socket -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Same thing with the "core" form. % mpirun -np 5 --mca rmaps_base_schedule_policy core --mca orte_process_binding core -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Now, I set the MCA parameters as environment variables. # I then check the spellings and confirm all is set using ompi_info. % setenv OMPI_MCA_rmaps_base_schedule_policy socket % setenv OMPI_MCA_orte_process_binding socket % ompi_info -a | grep rmaps_base_schedule_policy MCA rmaps: parameter "rmaps_base_schedule_policy" (current value: "socket", data source: environment) % ompi_info -a | grep orte_process_binding MCA orte: parameter "orte_process_binding" (current value: "socket", data source: environment) # So, now I run a simple program. # I get binding now, but I'm filling up the first socket before going to the second. # THIS IS PROBLEM 2. % mpirun -np 5 -report-bindings hostname [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],0] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],1] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],2] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],3] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],4] to socket 1 cpus 00f0 saem9 saem9 saem9 saem9 saem9 # Adding -bysocket to the command line fixes things. % mpirun -np 5 -bysocket -report-bindings hostname [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],0] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],1] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],2] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],3] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],4] to socket 0 cpus 000f saem9 saem9 saem9 saem9 saem9 Bug? Or am I doing something wrong? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] binding with MCA parameters: broken or user error?
On Oct 12, 2009, at 9:19 AM, Terry Dontje wrote: Ralph Castain wrote: I fixed the process schedule issue on the trunk over the weekend (not moved to 1.3 yet while it "soaked") - the binding issue was working fine on the trunk. So there was an issue of "-mca orte_process_binding" not being interpreted? I could not replicate the binding problem on the trunk. I haven't explored it further just yet. I believe I applied the fix to stop calling register_params twice to 1.3 already, but I can check. No I was asking whether that fix might be causing the orte_process_binding mca param to not be interpreted. But I think from what you say in the first paragraph I guess I probably was wrong. I don't see how, but I will look at it later. --td On Oct 12, 2009, at 4:36 AM, Terry Dontje wrote: In regards to the "-mca XXX" option not overriding the file setting I thought I saw this working for v1.3. However, I just retested this and I am seeing the same issue of the "-mca" option not affecting orte_process_binding or rmaps_base_schedule_policy. This seems to work under the trunk. I wonder if the issue might be something we did in r22050 where we stopped calling orte_register_params twice? Not sure exactly why that would have prevented the mca option setting taking place the first time. --td Ralph Castain wrote: Try adding -display-devel-map to your cmd line so you can see what OMPI thinks the binding and mapping policy is set to - that'll tell you if the problem is in the mapping or in the daemon binding. Also, it might help to know something about this node - like how many sockets, cores/socket. On Oct 8, 2009, at 11:17 PM, Eugene Loh wrote: Here are two problems with openmpi-1.3.4a1r22051 # Here, I try to run the moral equivalent of -bysocket -bind-to- socket, # using the MCA parameter form specified on the mpirun command line. # No binding results. THIS IS PROBLEM 1. % mpirun -np 5 --mca rmaps_base_schedule_policy socket --mca orte_process_binding socket -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Same thing with the "core" form. % mpirun -np 5 --mca rmaps_base_schedule_policy core --mca orte_process_binding core -report-bindings hostname saem9 saem9 saem9 saem9 saem9 # Now, I set the MCA parameters as environment variables. # I then check the spellings and confirm all is set using ompi_info. % setenv OMPI_MCA_rmaps_base_schedule_policy socket % setenv OMPI_MCA_orte_process_binding socket % ompi_info -a | grep rmaps_base_schedule_policy MCA rmaps: parameter "rmaps_base_schedule_policy" (current value: "socket", data source: environment) % ompi_info -a | grep orte_process_binding MCA orte: parameter "orte_process_binding" (current value: "socket", data source: environment) # So, now I run a simple program. # I get binding now, but I'm filling up the first socket before going to the second. # THIS IS PROBLEM 2. % mpirun -np 5 -report-bindings hostname [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],0] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],1] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],2] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],3] to socket 0 cpus 000f [saem9:23947] [[29741,0],0] odls:default:fork binding child [[29741,1],4] to socket 1 cpus 00f0 saem9 saem9 saem9 saem9 saem9 # Adding -bysocket to the command line fixes things. % mpirun -np 5 -bysocket -report-bindings hostname [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],0] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],1] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],2] to socket 0 cpus 000f [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],3] to socket 1 cpus 00f0 [saem9:23953] [[29751,0],0] odls:default:fork binding child [[29751,1],4] to socket 0 cpus 000f saem9 saem9 saem9 saem9 saem9 Bug? Or am I doing something wrong? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI users] cartofile
This e-mail was on the users alias... see http://www.open-mpi.org/community/lists/users/2009/09/10710.php There wasn't much response, so let me ask another question. How about if we remove the cartofile section from the DESCRIPTION section of the OMPI mpirun man page? It's a lot of text that illustrates how to create a cartofile without saying anything about why one would want to go to the trouble. What does this impact? What does it change? What's the motivation for doing this stuff? What's this stuff good for? Another alternative could be to move the cartofile description to a FAQ page. The mpirun man page is rather long and I was thinking that if we could remove some "low impact" stuff out, we could improve the overall signal-to-noise ratio of the page. In any case, I personally would like to know what cartofiles are good for. Eugene Loh wrote: Thank you, but I don't understand who is consuming this information for what. E.g., the mpirun man page describes the carto file, but doesn't give users any indication whether they should be worrying about this. Lenny Verkhovsky wrote: Hi Eugene, carto file is a file with a staic graph topology of your node. in the opal/mca/carto/file/carto_file.h you can see example. ( yes I know that , it should be help/man list :) ) Basically it describes a map of your node and inside interconnection. Hopefully it will be discovered automatically someday, but for now you can describe your node manually. Best regards Lenny. On Thu, Sep 17, 2009 at 12:38 AM, Eugene Lohwrote: I feel like I should know, but what's a cartofile? I guess you supply "topological" information about a host, but I can't tell how this information is used by, say, mpirun.