Thanks, that helps. Now it actually starts to communicate with the orterun process. This still fails but I will try to fix it.
On Tue, Jan 21, 2014 at 12:27:55PM -0800, Ralph Castain wrote: > That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no > !). The problem is that orte-checkpoint is a tool, and so it isn't a daemon - > but it is also not an app. > > > On Jan 21, 2014, at 11:56 AM, Adrian Reber <adr...@lisas.de> wrote: > > > Good to know that it does not make any sense. So it not just me. > > > > Looking at the call chain I can see > > > > orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON); > > > > and the second parameter is used to decide if it is an app or not: > > > > int orte_snapc_base_select(bool seed, bool app) in > > orte/mca/snapc/base/snapc_base_select.c > > > > and if it is true the code with the barrier is used. > > > > In orte/mca/snapc/base/snapc_base_select.c there is also following > > comment: > > > > /* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once > > the framework system is in place */ > > > > Is this something which needs to be changed and which might be the cause > > for this problem? > > > > > > On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote: > >> That doesn't make any sense - I can't imagine a reason for orte-checkpoint > >> itself to be running a barrier. I wonder if it is selecting the wrong > >> component in snapc? > >> > >> As for the patch, that isn't going to work. The collective id has to be > >> *globally* unique, which means that only orterun can issue a new one. So > >> you have to get thru orte_init before you can request one as it requires a > >> communication. > >> > >> However, like I said, it makes no sense for orte-checkpoint to do a > >> barrier as it is a singleton - there is nothing for it to "barrier" with. > >> > >> On Jan 21, 2014, at 7:24 AM, Adrian Reber <adr...@lisas.de> wrote: > >> > >>> I think I still do not really understand how it works. > >>> > >>> The barrier on which orte-checkpoint is currently hanging is in > >>> app_coord_init(). You are also saying that orte-checkpoint > >>> should not be calling a barrier. The backtrace of the point where it > >>> is hanging now looks like: > >>> > >>> #0 0x00007ffff69befa0 in __nanosleep_nocancel () at > >>> ../sysdeps/unix/syscall-template.S:81 > >>> #1 0x00007ffff7b45712 in app_coord_init () at > >>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:208 > >>> #2 0x00007ffff7b3a5ce in orte_snapc_full_module_init (seed=false, > >>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207 > >>> #3 0x00007ffff7b375de in orte_snapc_base_select (seed=false, app=true) > >>> at ../../../../orte/mca/snapc/base/snapc_base_select.c:96 > >>> #4 0x00007ffff7a9884a in orte_ess_base_tool_setup () at > >>> ../../../../orte/mca/ess/base/ess_base_std_tool.c:192 > >>> #5 0x00007ffff7a9fe85 in rte_init () at > >>> ../../../../../orte/mca/ess/tool/ess_tool_module.c:83 > >>> #6 0x00007ffff7a4647f in orte_init (pargc=0x7fffffffd94c, > >>> pargv=0x7fffffffd940, flags=8) at ../../orte/runtime/orte_init.c:158 > >>> #7 0x0000000000402859 in ckpt_init (argc=51, argv=0x7fffffffda78) at > >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610 > >>> #8 0x0000000000401d7a in main (argc=51, argv=0x7fffffffda78) at > >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245 > >>> > >>> Maybe I am doing something completely wrong. I am currently > >>> running 'orterun -np 2 test-programm'. > >>> > >>> In another terminal I am starting orte-checkpoint with the PID of > >>> orterun and the barrier in app_coord_init() is just before it tries > >>> to communicate with orterun. Is this the correct setup? > >>> > >>> Adrian > >>> > >>> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote: > >>>> If it is the application, then there is probably a barrier in the > >>>> app_coord_init() to make sure all the applications are up and running. > >>>> After this point then the global coordinator knows that the application > >>>> can > >>>> be checkpointed. > >>>> > >>>> I don't think orte-checkpoint should be calling a barrier - from what I > >>>> recall. > >>>> > >>>> > >>>> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain <r...@open-mpi.org> wrote: > >>>> > >>>>> Is it orte-checkpoint that is hanging, or the app you are trying to > >>>>> checkpoint? > >>>>> > >>>>> > >>>>> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adr...@lisas.de> wrote: > >>>>> > >>>>> Thanks for your help. I tried initializing the barrier correctly (see > >>>>> attached patch) but now, instead of crashing, it just hangs on the > >>>>> barrier while running orte-checkpoint > >>>>> > >>>>> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier > >>>>> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at > >>>>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206 > >>>>> > >>>>> #0 0x00007ffff69befa0 in __nanosleep_nocancel () at > >>>>> ../sysdeps/unix/syscall-template.S:81 > >>>>> #1 0x00007ffff7b456ba in app_coord_init () at > >>>>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207 > >>>>> #2 0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false, > >>>>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207 > >>>>> > >>>>> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active); > >>>>> > >>>>> I do not understand on what the barrier here is actually waiting for. > >>>>> Where > >>>>> do I need to look to find the place the barrier is waiting for? > >>>>> > >>>>> I also tried initializing the collective id's in > >>>>> orte/mca/plm/base/plm_base_launch_support.c but that code is never > >>>>> used running the orte-checkpoint tool > >>>>> > >>>>> Adrian > >>>>> > >>>>> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote: > >>>>> > >>>>> I took a look at this, and I'm afraid you have some work to do in the > >>>>> orte/mca/snapc code base: > >>>>> > >>>>> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. > >>>>> See > >>>>> r30261 for an example of the changes that need to be made - I did some, > >>>>> but > >>>>> can't swear to catching them all. It was enough to at least get a proc > >>>>> past > >>>>> the initial snapc registration > >>>>> > >>>>> 2. you are reusing collective id's to execute several > >>>>> orte_grpcomm.barrier > >>>>> calls - those ids are used elsewhere during MPI_Init. This is not > >>>>> allowed - > >>>>> a collective id can only be used *once*. What you need to do is go into > >>>>> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) > >>>>> add > >>>>> cr-specific collective id's for this purpose. I don't know how many > >>>>> places > >>>>> in the cr code create their own barriers, but they each need a > >>>>> collective > >>>>> id. > >>>>> > >>>>> If you prefer and have the time, you are welcome to extend the > >>>>> collective > >>>>> code to allow id reuse. This would require that each daemon and app > >>>>> "reset" > >>>>> the collective fields when a collective is declared complete. It isn't > >>>>> that > >>>>> hard to do - just never had a reason to do it. I can take a shot at it > >>>>> when > >>>>> time permits (may have some time this weekend) > >>>>> > >>>>> 3. when you post the non-blocking recv in the snapc/full code, it looks > >>>>> to > >>>>> me like you need to block until you get the answer. I don't know where > >>>>> in > >>>>> the code flow this is occurring - if you are not in an event, then it is > >>>>> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in > >>>>> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an > >>>>> example. > >>>>> > >>>>> HTH > >>>>> Ralph > >>>>> > >>>>> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote: > >>>>> > >>>>> > >>>>> On Jan 10, 2014, at 12:45 PM, Adrian Reber <adr...@lisas.de> wrote: > >>>>> > >>>>> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote: > >>>>> > >>>>> > >>>>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote: > >>>>> > >>>>> I am currently trying to understand how callbacks are working. Right now > >>>>> I am looking at orte/mca/rml/base/rml_base_receive.c > >>>>> orte_rml_base_comm_start() which does > >>>>> > >>>>> orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, > >>>>> ORTE_RML_TAG_RML_INFO_UPDATE, > >>>>> ORTE_RML_PERSISTENT, > >>>>> orte_rml_base_recv, > >>>>> NULL); > >>>>> > >>>>> As far as I understand it orte_rml_base_recv() is the callback function. > >>>>> At which point should this function run? When the data is actually > >>>>> received? > >>>>> > >>>>> > >>>>> Not precisely. When data is received by the OOB, it pushes the data into > >>>>> an event. When that event gets serviced, it calls the > >>>>> orte_rml_base_receive > >>>>> function which processes the data to find the matching tag, and then > >>>>> uses > >>>>> that to execute the callback to the user code. > >>>>> > >>>>> > >>>>> The same for send_buffer_nb() functions. I do not see the callback > >>>>> functions actually running. How can I verify that the callback functions > >>>>> are running. Especially for the send case it sounds pretty obvious how > >>>>> it should work but I never see the callback function running. At least > >>>>> in my setup. > >>>>> > >>>>> > >>>>> The data is not immediately sent. It gets pushed into an event. When > >>>>> that > >>>>> event gets serviced, it calls the orte_oob_base_send function which then > >>>>> passes the data to each active OOB component until one of them says it > >>>>> can > >>>>> send it. The data is then pushed into another event to get it into the > >>>>> event base for that component's active module - when that event gets > >>>>> serviced, the data is sent. Once the data is sent, an event is created > >>>>> that, when serviced, executes the callback to the user code. > >>>>> > >>>>> If you aren't seeing callbacks, the most likely cause is that the orte > >>>>> progress thread isn't running. Without it, none of this will work. > >>>>> > >>>>> > >>>>> Thanks. Running configure without '--with-ft=cr' I can run a program and > >>>>> use orte-top. In orterun I can see that the callback is running and > >>>>> orte-top displays the retrieved information. I can also see in orte-top > >>>>> that the callbacks are working. > >>>>> > >>>>> > >>>>> Actually, I'm rather impressed - I hadn't tested orte-top and didn't > >>>>> honestly know if it would work any more! Glad to hear it does :-) > >>>>> > >>>>> Doing the same with '--with-ft=cr' > >>>>> enabled orte-top crashes as well as orte-checkpoint and both (-top and > >>>>> -checkpoint) seem to no longer have working callbacks and that is why > >>>>> they are probably crashing. So some code which is enabled by > >>>>> '--with-ft=cr' > >>>>> seems to break callbacks in orte-top as well as in orte-checkpoint. > >>>>> orterun handles callbacks no matter if configured with or without > >>>>> '--with-ft=cr'. > >>>>> > >>>>> > >>>>> I can take a look this weekend - probably something silly > >>>>> > >>>>> > >>>>> Adrian > >>>>> > >>>>> <grpcomm.txt>_______________________________________________ > >>>>> > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Joshua Hursey > >>>> Assistant Professor of Computer Science > >>>> University of Wisconsin-La Crosse > >>>> http://cs.uwlax.edu/~jjhursey > >>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> > >>> Adrian > >>> > >>> -- > >>> Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ > >>> QOTD: > >>> "I tried buying a goat instead of a lawn tractor; had to return > >>> it though. Couldn't figure out a way to connect the snow blower." > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > Adrian > > > > -- > > Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ > > Hempstone's Question: > > If you have to travel on the Titanic, why not go first class? > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel Adrian -- Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ Ummm, well, OK. The network's the network, the computer's the computer. Sorry for the confusion. -- Sun Microsystems