Thanks, that helps. Now it actually starts to communicate with the
orterun process. This still fails but I will try to fix it.

On Tue, Jan 21, 2014 at 12:27:55PM -0800, Ralph Castain wrote:
> That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no 
> !). The problem is that orte-checkpoint is a tool, and so it isn't a daemon - 
> but it is also not an app.
> 
> 
> On Jan 21, 2014, at 11:56 AM, Adrian Reber <adr...@lisas.de> wrote:
> 
> > Good to know that it does not make any sense. So it not just me.
> > 
> > Looking at the call chain I can see
> > 
> > orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);
> > 
> > and the second parameter is used to decide if it is an app or not:
> > 
> > int orte_snapc_base_select(bool seed, bool app) in 
> > orte/mca/snapc/base/snapc_base_select.c
> > 
> > and if it is true the code with the barrier is used.
> > 
> > In orte/mca/snapc/base/snapc_base_select.c there is also following
> > comment:
> > 
> > /* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once 
> > the framework system is in place */
> > 
> > Is this something which needs to be changed and which might be the cause
> > for this problem?
> > 
> > 
> > On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
> >> That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
> >> itself to be running a barrier. I wonder if it is selecting the wrong 
> >> component in snapc?
> >> 
> >> As for the patch, that isn't going to work. The collective id has to be 
> >> *globally* unique, which means that only orterun can issue a new one. So 
> >> you have to get thru orte_init before you can request one as it requires a 
> >> communication.
> >> 
> >> However, like I said, it makes no sense for orte-checkpoint to do a 
> >> barrier as it is a singleton - there is nothing for it to "barrier" with.
> >> 
> >> On Jan 21, 2014, at 7:24 AM, Adrian Reber <adr...@lisas.de> wrote:
> >> 
> >>> I think I still do not really understand how it works.
> >>> 
> >>> The barrier on which orte-checkpoint is currently hanging is in
> >>> app_coord_init(). You are also saying that orte-checkpoint
> >>> should not be calling a barrier. The backtrace of the point where it
> >>> is hanging now looks like:
> >>> 
> >>> #0  0x00007ffff69befa0 in __nanosleep_nocancel () at 
> >>> ../sysdeps/unix/syscall-template.S:81
> >>> #1  0x00007ffff7b45712 in app_coord_init () at 
> >>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
> >>> #2  0x00007ffff7b3a5ce in orte_snapc_full_module_init (seed=false, 
> >>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> >>> #3  0x00007ffff7b375de in orte_snapc_base_select (seed=false, app=true) 
> >>> at ../../../../orte/mca/snapc/base/snapc_base_select.c:96
> >>> #4  0x00007ffff7a9884a in orte_ess_base_tool_setup () at 
> >>> ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
> >>> #5  0x00007ffff7a9fe85 in rte_init () at 
> >>> ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
> >>> #6  0x00007ffff7a4647f in orte_init (pargc=0x7fffffffd94c, 
> >>> pargv=0x7fffffffd940, flags=8) at ../../orte/runtime/orte_init.c:158
> >>> #7  0x0000000000402859 in ckpt_init (argc=51, argv=0x7fffffffda78) at 
> >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
> >>> #8  0x0000000000401d7a in main (argc=51, argv=0x7fffffffda78) at 
> >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
> >>> 
> >>> Maybe I am doing something completely wrong. I am currently
> >>> running 'orterun -np 2 test-programm'.
> >>> 
> >>> In another terminal I am starting orte-checkpoint with the PID of
> >>> orterun and the barrier in app_coord_init() is just before it tries
> >>> to communicate with orterun. Is this the correct setup?
> >>> 
> >>>           Adrian
> >>> 
> >>> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
> >>>> If it is the application, then there is probably a barrier in the
> >>>> app_coord_init() to make sure all the applications are up and running.
> >>>> After this point then the global coordinator knows that the application 
> >>>> can
> >>>> be checkpointed.
> >>>> 
> >>>> I don't think orte-checkpoint should be calling a barrier - from what I
> >>>> recall.
> >>>> 
> >>>> 
> >>>> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>> 
> >>>>> Is it orte-checkpoint that is hanging, or the app you are trying to
> >>>>> checkpoint?
> >>>>> 
> >>>>> 
> >>>>> On Jan 20, 2014, at 2:10 PM, Adrian Reber <adr...@lisas.de> wrote:
> >>>>> 
> >>>>> Thanks for your help. I tried initializing the barrier correctly (see
> >>>>> attached patch) but now, instead of crashing, it just hangs on the
> >>>>> barrier while running orte-checkpoint
> >>>>> 
> >>>>> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> >>>>> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> >>>>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> >>>>> 
> >>>>> #0  0x00007ffff69befa0 in __nanosleep_nocancel () at
> >>>>> ../sysdeps/unix/syscall-template.S:81
> >>>>> #1  0x00007ffff7b456ba in app_coord_init () at
> >>>>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> >>>>> #2  0x00007ffff7b3a582 in orte_snapc_full_module_init (seed=false,
> >>>>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> >>>>> 
> >>>>> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> >>>>> 
> >>>>> I do not understand on what the barrier here is actually waiting for. 
> >>>>> Where
> >>>>> do I need to look to find the place the barrier is waiting for?
> >>>>> 
> >>>>> I also tried initializing the collective id's in
> >>>>> orte/mca/plm/base/plm_base_launch_support.c but that code is never
> >>>>> used running the orte-checkpoint tool
> >>>>> 
> >>>>> Adrian
> >>>>> 
> >>>>> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> >>>>> 
> >>>>> I took a look at this, and I'm afraid you have some work to do in the
> >>>>> orte/mca/snapc code base:
> >>>>> 
> >>>>> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. 
> >>>>> See
> >>>>> r30261 for an example of the changes that need to be made - I did some, 
> >>>>> but
> >>>>> can't swear to catching them all. It was enough to at least get a proc 
> >>>>> past
> >>>>> the initial snapc registration
> >>>>> 
> >>>>> 2. you are reusing collective id's to execute several 
> >>>>> orte_grpcomm.barrier
> >>>>> calls - those ids are used elsewhere during MPI_Init. This is not 
> >>>>> allowed -
> >>>>> a collective id can only be used *once*. What you need to do is go into
> >>>>> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) 
> >>>>> add
> >>>>> cr-specific collective id's for this purpose. I don't know how many 
> >>>>> places
> >>>>> in the cr code create their own barriers, but they each need a 
> >>>>> collective
> >>>>> id.
> >>>>> 
> >>>>> If you prefer and have the time, you are welcome to extend the 
> >>>>> collective
> >>>>> code to allow id reuse. This would require that each daemon and app 
> >>>>> "reset"
> >>>>> the collective fields when a collective is declared complete. It isn't 
> >>>>> that
> >>>>> hard to do - just never had a reason to do it. I can take a shot at it 
> >>>>> when
> >>>>> time permits (may have some time this weekend)
> >>>>> 
> >>>>> 3. when you post the non-blocking recv in the snapc/full code, it looks 
> >>>>> to
> >>>>> me like you need to block until you get the answer. I don't know where 
> >>>>> in
> >>>>> the code flow this is occurring - if you are not in an event, then it is
> >>>>> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in
> >>>>> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an 
> >>>>> example.
> >>>>> 
> >>>>> HTH
> >>>>> Ralph
> >>>>> 
> >>>>> On Jan 10, 2014, at 12:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>> 
> >>>>> 
> >>>>> On Jan 10, 2014, at 12:45 PM, Adrian Reber <adr...@lisas.de> wrote:
> >>>>> 
> >>>>> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> >>>>> 
> >>>>> 
> >>>>> On Jan 10, 2014, at 8:02 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>>>> 
> >>>>> I am currently trying to understand how callbacks are working. Right now
> >>>>> I am looking at orte/mca/rml/base/rml_base_receive.c
> >>>>> orte_rml_base_comm_start() which does
> >>>>> 
> >>>>> orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >>>>>                        ORTE_RML_TAG_RML_INFO_UPDATE,
> >>>>>                        ORTE_RML_PERSISTENT,
> >>>>>                        orte_rml_base_recv,
> >>>>>                        NULL);
> >>>>> 
> >>>>> As far as I understand it orte_rml_base_recv() is the callback function.
> >>>>> At which point should this function run? When the data is actually
> >>>>> received?
> >>>>> 
> >>>>> 
> >>>>> Not precisely. When data is received by the OOB, it pushes the data into
> >>>>> an event. When that event gets serviced, it calls the 
> >>>>> orte_rml_base_receive
> >>>>> function which processes the data to find the matching tag, and then 
> >>>>> uses
> >>>>> that to execute the callback to the user code.
> >>>>> 
> >>>>> 
> >>>>> The same for send_buffer_nb() functions. I do not see the callback
> >>>>> functions actually running. How can I verify that the callback functions
> >>>>> are running. Especially for the send case it sounds pretty obvious how
> >>>>> it should work but I never see the callback function running. At least
> >>>>> in my setup.
> >>>>> 
> >>>>> 
> >>>>> The data is not immediately sent. It gets pushed into an event. When 
> >>>>> that
> >>>>> event gets serviced, it calls the orte_oob_base_send function which then
> >>>>> passes the data to each active OOB component until one of them says it 
> >>>>> can
> >>>>> send it. The data is then pushed into another event to get it into the
> >>>>> event base for that component's active module - when that event gets
> >>>>> serviced, the data is sent. Once the data is sent, an event is created
> >>>>> that, when serviced, executes the callback to the user code.
> >>>>> 
> >>>>> If you aren't seeing callbacks, the most likely cause is that the orte
> >>>>> progress thread isn't running. Without it, none of this will work.
> >>>>> 
> >>>>> 
> >>>>> Thanks. Running configure without '--with-ft=cr' I can run a program and
> >>>>> use orte-top. In orterun I can see that the callback is running and
> >>>>> orte-top displays the retrieved information. I can also see in orte-top
> >>>>> that the callbacks are working.
> >>>>> 
> >>>>> 
> >>>>> Actually, I'm rather impressed - I hadn't tested orte-top and didn't
> >>>>> honestly know if it would work any more! Glad to hear it does :-)
> >>>>> 
> >>>>> Doing the same with '--with-ft=cr'
> >>>>> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> >>>>> -checkpoint) seem to no longer have working callbacks and that is why
> >>>>> they are probably crashing. So some code which is enabled by 
> >>>>> '--with-ft=cr'
> >>>>> seems to break callbacks in orte-top as well as in orte-checkpoint.
> >>>>> orterun handles callbacks no matter if configured with or without
> >>>>> '--with-ft=cr'.
> >>>>> 
> >>>>> 
> >>>>> I can take a look this weekend - probably something silly
> >>>>> 
> >>>>> 
> >>>>> Adrian
> >>>>> 
> >>>>> <grpcomm.txt>_______________________________________________
> >>>>> 
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> -- 
> >>>> Joshua Hursey
> >>>> Assistant Professor of Computer Science
> >>>> University of Wisconsin-La Crosse
> >>>> http://cs.uwlax.edu/~jjhursey
> >>> 
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>> 
> >>>           Adrian
> >>> 
> >>> -- 
> >>> Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
> >>> QOTD:
> >>>   "I tried buying a goat instead of a lawn tractor; had to return
> >>>   it though.  Couldn't figure out a way to connect the snow blower."
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> >             Adrian
> > 
> > -- 
> > Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
> > Hempstone's Question:
> >     If you have to travel on the Titanic, why not go first class?
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

                Adrian

-- 
Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
Ummm, well, OK.  The network's the network, the computer's the computer.
Sorry for the confusion.
                -- Sun Microsystems

Reply via email to