Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
I'm afraid Sylvain is right, and we have a bug in ompi_info: MCA routed: parameter "routed_binomial_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_cm_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_direct_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_linear_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_radix_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_slave_priority" (current value: <0>, data source: default value) Those params do not exist in the code base. I think we -assume- that every component will have an MCA param for setting priority, but most of the ORTE ones do not. We'll need to review ompi_info and fix this. On Nov 30, 2009, at 5:22 PM, Jeff Squyres wrote: > On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote: > >> About my previous e-mail, I was wrong about all components having a 0 >> priority : it was based on default parameters reported by "ompi_info -a | >> grep routed". It seems that the truth is not always in ompi_info ... >> > > > ompi_info *does* always report the truth. Those values are what the run-time > thinks they are currently set to -- either via environment, file, or whatever > other mechanism. You might want to check your setup and see if they're being > set via an unexpected mechanism...? Try using the "--parsable" switch and > grep for "data_source" to see where values are getting set from. > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
This is not a bug, it's a feature. :-) The MCA base automatically adds a priority MCA parameter for every component. On Dec 1, 2009, at 7:40 AM, Ralph Castain wrote: I'm afraid Sylvain is right, and we have a bug in ompi_info: MCA routed: parameter "routed_binomial_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_cm_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_direct_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_linear_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_radix_priority" (current value: <0>, data source: default value) MCA routed: parameter "routed_slave_priority" (current value: <0>, data source: default value) Those params do not exist in the code base. I think we -assume- that every component will have an MCA param for setting priority, but most of the ORTE ones do not. We'll need to review ompi_info and fix this. On Nov 30, 2009, at 5:22 PM, Jeff Squyres wrote: > On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote: > >> About my previous e-mail, I was wrong about all components having a 0 >> priority : it was based on default parameters reported by "ompi_info -a | >> grep routed". It seems that the truth is not always in ompi_info ... >> > > > ompi_info *does* always report the truth. Those values are what the run-time thinks they are currently set to -- either via environment, file, or whatever other mechanism. You might want to check your setup and see if they're being set via an unexpected mechanism...? Try using the "--parsable" switch and grep for "data_source" to see where values are getting set from. > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] OPEN-MPI Fault-Tolerance for GASNet
Thomas, I have not tried to use the checkpoint/restart feature with GASNet over MPI, so I cannot comment directly on how they interact. However, the combination should work as long as the proper arguments (-am ft- enable-cr) are passed along to the mpirun command, and Open MPI is configured properly. The error message that you copied seems to indicate that the local daemon on one of the nodes failed to start a checkpoint of the target application. Often this is caused by one of two things: - Open MPI was not configured with the fault tolerance thread, and the application is waiting for a long time in a computation loop (not entering the MPI library). - The '-am ft-enable-cr' flag was not provided to the mpirun process, so the MPI application did not activate the C/R specific code paths and is therefore denying the request to checkpoint. Can you send me a bit more information: - What version of Open MPI are you using? - How did you configure Open MPI? - What arguments are being passed to 'mpirun' when running with GASNet? - Do you have any environment variables/MCA parameters set for Open MPI? -- Josh On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote: Dear all. Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the checkpoint/restart function very well for my MPI applications. But its checkpoint does not work for my GASNet applications which use the MPI conduit. Is here anyone else to help me? I wrote some code with GASNet API (Global-Address Space Networking: http://gasnet.cs.berkeley.edu/) and used MPI conduit for my gasnet application, so my program ran well with open-mpirun. Thus I thought that I could also use the transparent checkpoint/restart function supported by BLCR in Open- mpi. As opposed to my idea, it does not work and show the following error message. -- Error: The process with PID 13896 is not checkpointable. This could be due to one of the following: - An application with this PID doesn't currently exist - The application with this PID isn't checkpointable - The application with this PID isn't an OPAL application. We were looking for the named files: /tmp/opal_cr_prog_write.13896 /tmp/opal_cr_prog_read.13896 -- 1 more process has sent help message help-opal-checkpoint.txt Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 0] 13896) Step 53 0] 15100) Step 53 0] 13896) Step 54 0] 15100) Step 54 0] 13896) Step 55 In my application, the MPI_Initialized() says it is initialized. Thank you for your reading and have a great day. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Doodle to discuss 2 MPI_Request changes
Ok, I setup a doodle to figure out a time when to chat about the 2 MPI_Request changes: 1. What Brian just did for the ROMIO interface (because others might benefit from the same technique). 2. Details of the pending RFC. http://doodle.com/qgvvydks5t5s9dp4 Please reply to the Doodle by Thursday COB so that I can setup a webex and send around the contact info. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] MPI_Graph_create
It looks like MPI_Cart_create argument checking was fixed in 1.3.4 but not MPI_Graph_create. vayu1:~/openmpi-1.3.4 > diff -w -u ompi/mpi/c/cart_create.c ompi/mpi/c/graph_create.c ... -int MPI_Cart_create(MPI_Comm old_comm, int ndims, int *dims, -int *periods, int reorder, MPI_Comm *comm_cart) { +int MPI_Graph_create(MPI_Comm old_comm, int nnodes, int *index, + int *edges, int reorder, MPI_Comm *comm_graph) +{ ... +if ((0 > reorder) || (1 < reorder)) { David David Singleton wrote: Kiril Dichev has already pointed a problem with MPI_Cart_create http://www.open-mpi.org/community/lists/devel/2009/08/6627.php MPI_Graph_create has the same problem. I checked all other functions with logical in arguments and no others do anything similar.
Re: [OMPI devel] OPEN-MPI Fault-Tolerance for GASNet
Thomas, I connection with Josh's question about mpirun arguments, I suggest you try setting MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A' in your environment before launching the GASNet application. This will instruct GASNet's wrapper around mpirun to include the flag Josh mentioned. -Paul Josh Hursey wrote: Thomas, I have not tried to use the checkpoint/restart feature with GASNet over MPI, so I cannot comment directly on how they interact. However, the combination should work as long as the proper arguments (-am ft-enable-cr) are passed along to the mpirun command, and Open MPI is configured properly. The error message that you copied seems to indicate that the local daemon on one of the nodes failed to start a checkpoint of the target application. Often this is caused by one of two things: - Open MPI was not configured with the fault tolerance thread, and the application is waiting for a long time in a computation loop (not entering the MPI library). - The '-am ft-enable-cr' flag was not provided to the mpirun process, so the MPI application did not activate the C/R specific code paths and is therefore denying the request to checkpoint. Can you send me a bit more information: - What version of Open MPI are you using? - How did you configure Open MPI? - What arguments are being passed to 'mpirun' when running with GASNet? - Do you have any environment variables/MCA parameters set for Open MPI? -- Josh On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote: Dear all. Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the checkpoint/restart function very well for my MPI applications. But its checkpoint does not work for my GASNet applications which use the MPI conduit. Is here anyone else to help me? I wrote some code with GASNet API (Global-Address Space Networking: http://gasnet.cs.berkeley.edu/)and used MPI conduit for my gasnet application, so my program ran well with open-mpirun. Thus I thought that I could also use the transparent checkpoint/restart function supported by BLCR in Open-mpi. As opposed to my idea, it does not work and show the following error message. -- Error: The process with PID 13896 is not checkpointable. This could be due to one of the following: - An application with this PID doesn't currently exist - The application with this PID isn't checkpointable - The application with this PID isn't an OPAL application. We were looking for the named files: /tmp/opal_cr_prog_write.13896 /tmp/opal_cr_prog_read.13896 -- 1 more process has sent help message help-opal-checkpoint.txt Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 0] 13896) Step 53 0] 15100) Step 53 0] 13896) Step 54 0] 15100) Step 54 0] 13896) Step 55 In my application, the MPI_Initialized() says it is initialized. Thank you for your reading and have a great day. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory
Re: [OMPI devel] MPI_Graph_create
You are absolutely correct. I've filed CMRs for v1.4 and v1.5. Thanks for the heads up! On Dec 1, 2009, at 4:26 PM, David Singleton wrote: It looks like MPI_Cart_create argument checking was fixed in 1.3.4 but not MPI_Graph_create. vayu1:~/openmpi-1.3.4 > diff -w -u ompi/mpi/c/cart_create.c ompi/mpi/ c/graph_create.c ... -int MPI_Cart_create(MPI_Comm old_comm, int ndims, int *dims, -int *periods, int reorder, MPI_Comm *comm_cart) { +int MPI_Graph_create(MPI_Comm old_comm, int nnodes, int *index, + int *edges, int reorder, MPI_Comm *comm_graph) +{ ... +if ((0 > reorder) || (1 < reorder)) { David David Singleton wrote: > > Kiril Dichev has already pointed a problem with MPI_Cart_create > http://www.open-mpi.org/community/lists/devel/2009/08/6627.php > MPI_Graph_create has the same problem. I checked all other > functions with logical in arguments and no others do anything > similar. > ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
The only issue with that is it implies there is a param that can be adjusted - and there isn't. So it can confuse a user - or even a developer, as it did here. I should think we wouldn't want MCA to automatically add any parameter. If the component doesn't register it, then it shouldn't exist. The MCA can just track a value without defining it as a visible param. True? On Dec 1, 2009, at 5:48 AM, Jeff Squyres wrote: > This is not a bug, it's a feature. :-) > > The MCA base automatically adds a priority MCA parameter for every component. > > > On Dec 1, 2009, at 7:40 AM, Ralph Castain wrote: > >> I'm afraid Sylvain is right, and we have a bug in ompi_info: >> >> MCA routed: parameter "routed_binomial_priority" (current >> value: <0>, data source: default value) >> MCA routed: parameter "routed_cm_priority" (current value: <0>, >> data source: default value) >> MCA routed: parameter "routed_direct_priority" (current value: >> <0>, data source: default value) >> MCA routed: parameter "routed_linear_priority" (current value: >> <0>, data source: default value) >> MCA routed: parameter "routed_radix_priority" (current value: >> <0>, data source: default value) >> MCA routed: parameter "routed_slave_priority" (current value: >> <0>, data source: default value) >> >> Those params do not exist in the code base. I think we -assume- that every >> component will have an MCA param for setting priority, but most of the ORTE >> ones do not. >> >> We'll need to review ompi_info and fix this. >> >> >> On Nov 30, 2009, at 5:22 PM, Jeff Squyres wrote: >> >> > On Nov 30, 2009, at 10:48 AM, Sylvain Jeaugey wrote: >> > >> >> About my previous e-mail, I was wrong about all components having a 0 >> >> priority : it was based on default parameters reported by "ompi_info -a | >> >> grep routed". It seems that the truth is not always in ompi_info ... >> >> >> > >> > >> > ompi_info *does* always report the truth. Those values are what the >> > run-time thinks they are currently set to -- either via environment, file, >> > or whatever other mechanism. You might want to check your setup and see >> > if they're being set via an unexpected mechanism...? Try using the >> > "--parsable" switch and grep for "data_source" to see where values are >> > getting set from. >> > >> > -- >> > Jeff Squyres >> > jsquy...@cisco.com >> > >> > ___ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote: The only issue with that is it implies there is a param that can be adjusted - and there isn't. So it can confuse a user - or even a developer, as it did here. I should think we wouldn't want MCA to automatically add any parameter. If the component doesn't register it, then it shouldn't exist. The MCA can just track a value without defining it as a visible param. True? The original code came from long, long ago -- when every component did have a relevant priority (i.e., when priority was about the only way to choose which one was used). Developers didn't want to register a "foo_priority" param for every single component, so we made it automatic. That doesn't necessarily fit anymore -- as Ralph said, priority isn't relevant for some frameworks. So perhaps it can become a param in the downcall to the MCA base as to whether the priority params should be automatically registered...? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
On Dec 1, 2009, at 3:40 PM, Jeff Squyres wrote: > On Dec 1, 2009, at 5:23 PM, Ralph Castain wrote: > >> The only issue with that is it implies there is a param that can be adjusted >> - and there isn't. So it can confuse a user - or even a developer, as it did >> here. >> >> I should think we wouldn't want MCA to automatically add any parameter. If >> the component doesn't register it, then it shouldn't exist. The MCA can just >> track a value without defining it as a visible param. >> >> True? >> > > > The original code came from long, long ago -- when every component did have a > relevant priority (i.e., when priority was about the only way to choose which > one was used). Developers didn't want to register a "foo_priority" param for > every single component, so we made it automatic. > > That doesn't necessarily fit anymore -- as Ralph said, priority isn't > relevant for some frameworks. > > So perhaps it can become a param in the downcall to the MCA base as to > whether the priority params should be automatically registered...? I can live with that, though I again question why anything needs to be automatically registered. It sounds like we were lazy, and so now we have things happening automatically that can confuse people. I think priority has become a bit of an issue whenever we are talking about single-selection frameworks. If a user sets a priority to some value (whatever it is), there is an expectation that this means the component will be selected. As we learned in ORTE, that isn't true, leading to a lot of confusion and explanation. This is why we removed priority from most ORTE frameworks, and instead tell people to directly define the component to be used with -mca frame module. So I'm willing to go through the ORTE frameworks and revise the downcalls to the MCA base. However, I think we may want to rethink the entire priority scheme to ensure we have what we need today (as opposed to what we wrote a long time ago). Question: if the system automatically registers a priority param, and someone sets a priority with it, then what happens when the component returns a different (possibly hardwired) value? Does MCA base ignore what was set and use what was returned? I assume that is the case... > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] MPI_Graph_create
- "Jeff Squyres" wrote: > You are absolutely correct. I've filed CMRs > for v1.4 and v1.5. To clarify one point for people who weren't at the SC'09 Open-MPI BOF (hopefully I'll get this right!): 1.4 will be the bug-fix only continuation of the 1.3 feature series and will be binary compatible with 1.3.x where x >= 2. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] MPI_Graph_create
On Dec 1, 2009, at 7:00 PM, Chris Samuel wrote: > You are absolutely correct. I've filed CMRs > for v1.4 and v1.5. To clarify one point for people who weren't at the SC'09 Open-MPI BOF (hopefully I'll get this right!): 1.4 will be the bug-fix only continuation of the 1.3 feature series and will be binary compatible with 1.3.x where x >= 2. Absolutely correct. v1.5 is the next "feature" release series. A full explanation of our version numbering scheme is described here: http://www.open-mpi.org/software/ompi/versions/ -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote: > So perhaps it can become a param in the downcall to the MCA base as to whether the priority params should be automatically registered...? I can live with that, though I again question why anything needs to be automatically registered. It sounds like we were lazy, and so now we have things happening automatically that can confuse people. That pretty well sums it up. :-) Question: if the system automatically registers a priority param, and someone sets a priority with it, then what happens when the component returns a different (possibly hardwired) value? Does MCA base ignore what was set and use what was returned? I assume that is the case... If the component re-registers a priority param with a new default value, that new default becomes *the* default (overwriting the prior default value that was registered). Is that what you're asking? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
On Dec 1, 2009, at 5:48 PM, Jeff Squyres wrote: > On Dec 1, 2009, at 5:52 PM, Ralph Castain wrote: > >> > So perhaps it can become a param in the downcall to the MCA base as to >> > whether the priority params should be automatically registered...? >> >> I can live with that, though I again question why anything needs to be >> automatically registered. It sounds like we were lazy, and so now we have >> things happening automatically that can confuse people. >> > > That pretty well sums it up. :-) hehe > >> Question: if the system automatically registers a priority param, and >> someone sets a priority with it, then what happens when the component >> returns a different (possibly hardwired) value? Does MCA base ignore what >> was set and use what was returned? I assume that is the case... >> > > > If the component re-registers a priority param with a new default value, that > new default becomes *the* default (overwriting the prior default value that > was registered). > > Is that what you're asking? Not exactly - I was more curious about the hardwired case since no param is involved. We just return a value. If the MCA selection logic is looking at param values and not what was returned, then we would have a problem. I'm thinking that isn't the case, though, as I would expect to see strange behavior if that happened. > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel