[OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
I am currently trying to understand how callbacks are working. Right now
I am looking at orte/mca/rml/base/rml_base_receive.c
orte_rml_base_comm_start() which does 

orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
ORTE_RML_TAG_RML_INFO_UPDATE,
ORTE_RML_PERSISTENT,
orte_rml_base_recv,
NULL);

As far as I understand it orte_rml_base_recv() is the callback function.
At which point should this function run? When the data is actually
received?

The same for send_buffer_nb() functions. I do not see the callback
functions actually running. How can I verify that the callback functions
are running. Especially for the send case it sounds pretty obvious how
it should work but I never see the callback function running. At least
in my setup.

Adrian


Re: [OMPI devel] callback debugging

2014-01-10 Thread Ralph Castain

On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:

> I am currently trying to understand how callbacks are working. Right now
> I am looking at orte/mca/rml/base/rml_base_receive.c
> orte_rml_base_comm_start() which does 
> 
>orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>ORTE_RML_TAG_RML_INFO_UPDATE,
>ORTE_RML_PERSISTENT,
>orte_rml_base_recv,
>NULL);
> 
> As far as I understand it orte_rml_base_recv() is the callback function.
> At which point should this function run? When the data is actually
> received?

Not precisely. When data is received by the OOB, it pushes the data into an 
event. When that event gets serviced, it calls the orte_rml_base_receive 
function which processes the data to find the matching tag, and then uses that 
to execute the callback to the user code.

> 
> The same for send_buffer_nb() functions. I do not see the callback
> functions actually running. How can I verify that the callback functions
> are running. Especially for the send case it sounds pretty obvious how
> it should work but I never see the callback function running. At least
> in my setup.

The data is not immediately sent. It gets pushed into an event. When that event 
gets serviced, it calls the orte_oob_base_send function which then passes the 
data to each active OOB component until one of them says it can send it. The 
data is then pushed into another event to get it into the event base for that 
component's active module - when that event gets serviced, the data is sent. 
Once the data is sent, an event is created that, when serviced, executes the 
callback to the user code.

If you aren't seeing callbacks, the most likely cause is that the orte progress 
thread isn't running. Without it, none of this will work.

> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber
On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> 
> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
> 
> > I am currently trying to understand how callbacks are working. Right now
> > I am looking at orte/mca/rml/base/rml_base_receive.c
> > orte_rml_base_comm_start() which does 
> > 
> >orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >ORTE_RML_TAG_RML_INFO_UPDATE,
> >ORTE_RML_PERSISTENT,
> >orte_rml_base_recv,
> >NULL);
> > 
> > As far as I understand it orte_rml_base_recv() is the callback function.
> > At which point should this function run? When the data is actually
> > received?
> 
> Not precisely. When data is received by the OOB, it pushes the data into an 
> event. When that event gets serviced, it calls the orte_rml_base_receive 
> function which processes the data to find the matching tag, and then uses 
> that to execute the callback to the user code.
> 
> > 
> > The same for send_buffer_nb() functions. I do not see the callback
> > functions actually running. How can I verify that the callback functions
> > are running. Especially for the send case it sounds pretty obvious how
> > it should work but I never see the callback function running. At least
> > in my setup.
> 
> The data is not immediately sent. It gets pushed into an event. When that 
> event gets serviced, it calls the orte_oob_base_send function which then 
> passes the data to each active OOB component until one of them says it can 
> send it. The data is then pushed into another event to get it into the event 
> base for that component's active module - when that event gets serviced, the 
> data is sent. Once the data is sent, an event is created that, when serviced, 
> executes the callback to the user code.
> 
> If you aren't seeing callbacks, the most likely cause is that the orte 
> progress thread isn't running. Without it, none of this will work.

Thanks. Running configure without '--with-ft=cr' I can run a program and
use orte-top. In orterun I can see that the callback is running and
orte-top displays the retrieved information. I can also see in orte-top
that the callbacks are working. Doing the same with '--with-ft=cr'
enabled orte-top crashes as well as orte-checkpoint and both (-top and
-checkpoint) seem to no longer have working callbacks and that is why
they are probably crashing. So some code which is enabled by '--with-ft=cr'
seems to break callbacks in orte-top as well as in orte-checkpoint.
orterun handles callbacks no matter if configured with or without
'--with-ft=cr'.

Adrian


Re: [OMPI devel] callback debugging

2014-01-10 Thread Ralph Castain

On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:

> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>> 
>> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
>> 
>>> I am currently trying to understand how callbacks are working. Right now
>>> I am looking at orte/mca/rml/base/rml_base_receive.c
>>> orte_rml_base_comm_start() which does 
>>> 
>>>   orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>>>   ORTE_RML_TAG_RML_INFO_UPDATE,
>>>   ORTE_RML_PERSISTENT,
>>>   orte_rml_base_recv,
>>>   NULL);
>>> 
>>> As far as I understand it orte_rml_base_recv() is the callback function.
>>> At which point should this function run? When the data is actually
>>> received?
>> 
>> Not precisely. When data is received by the OOB, it pushes the data into an 
>> event. When that event gets serviced, it calls the orte_rml_base_receive 
>> function which processes the data to find the matching tag, and then uses 
>> that to execute the callback to the user code.
>> 
>>> 
>>> The same for send_buffer_nb() functions. I do not see the callback
>>> functions actually running. How can I verify that the callback functions
>>> are running. Especially for the send case it sounds pretty obvious how
>>> it should work but I never see the callback function running. At least
>>> in my setup.
>> 
>> The data is not immediately sent. It gets pushed into an event. When that 
>> event gets serviced, it calls the orte_oob_base_send function which then 
>> passes the data to each active OOB component until one of them says it can 
>> send it. The data is then pushed into another event to get it into the event 
>> base for that component's active module - when that event gets serviced, the 
>> data is sent. Once the data is sent, an event is created that, when 
>> serviced, executes the callback to the user code.
>> 
>> If you aren't seeing callbacks, the most likely cause is that the orte 
>> progress thread isn't running. Without it, none of this will work.
> 
> Thanks. Running configure without '--with-ft=cr' I can run a program and
> use orte-top. In orterun I can see that the callback is running and
> orte-top displays the retrieved information. I can also see in orte-top
> that the callbacks are working.

Actually, I'm rather impressed - I hadn't tested orte-top and didn't honestly 
know if it would work any more! Glad to hear it does :-)

> Doing the same with '--with-ft=cr'
> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> -checkpoint) seem to no longer have working callbacks and that is why
> they are probably crashing. So some code which is enabled by '--with-ft=cr'
> seems to break callbacks in orte-top as well as in orte-checkpoint.
> orterun handles callbacks no matter if configured with or without
> '--with-ft=cr'.

I can take a look this weekend - probably something silly

> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] callback debugging

2014-01-11 Thread Ralph Castain
I took a look at this, and I'm afraid you have some work to do in the 
orte/mca/snapc code base:

1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
r30261 for an example of the changes that need to be made - I did some, but 
can't swear to catching them all. It was enough to at least get a proc past the 
initial snapc registration

2. you are reusing collective id's to execute several orte_grpcomm.barrier 
calls - those ids are used elsewhere during MPI_Init. This is not allowed - a 
collective id can only be used *once*. What you need to do is go into 
orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add 
cr-specific collective id's for this purpose. I don't know how many places in 
the cr code create their own barriers, but they each need a collective id.

If you prefer and have the time, you are welcome to extend the collective code 
to allow id reuse. This would require that each daemon and app "reset" the 
collective fields when a collective is declared complete. It isn't that hard to 
do - just never had a reason to do it. I can take a shot at it when time 
permits (may have some time this weekend)

3. when you post the non-blocking recv in the snapc/full code, it looks to me 
like you need to block until you get the answer. I don't know where in the code 
flow this is occurring - if you are not in an event, then it is okay to block 
using ORTE_WAIT_FOR_COMPLETION. Look in orte/mca/routed/base/routed_base_fns.c 
starting at line 252 for an example.

HTH
Ralph

On Jan 10, 2014, at 12:55 PM, Ralph Castain  wrote:

> 
> On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:
> 
>> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>>> 
>>> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
>>> 
 I am currently trying to understand how callbacks are working. Right now
 I am looking at orte/mca/rml/base/rml_base_receive.c
 orte_rml_base_comm_start() which does 
 
   orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
   ORTE_RML_TAG_RML_INFO_UPDATE,
   ORTE_RML_PERSISTENT,
   orte_rml_base_recv,
   NULL);
 
 As far as I understand it orte_rml_base_recv() is the callback function.
 At which point should this function run? When the data is actually
 received?
>>> 
>>> Not precisely. When data is received by the OOB, it pushes the data into an 
>>> event. When that event gets serviced, it calls the orte_rml_base_receive 
>>> function which processes the data to find the matching tag, and then uses 
>>> that to execute the callback to the user code.
>>> 
 
 The same for send_buffer_nb() functions. I do not see the callback
 functions actually running. How can I verify that the callback functions
 are running. Especially for the send case it sounds pretty obvious how
 it should work but I never see the callback function running. At least
 in my setup.
>>> 
>>> The data is not immediately sent. It gets pushed into an event. When that 
>>> event gets serviced, it calls the orte_oob_base_send function which then 
>>> passes the data to each active OOB component until one of them says it can 
>>> send it. The data is then pushed into another event to get it into the 
>>> event base for that component's active module - when that event gets 
>>> serviced, the data is sent. Once the data is sent, an event is created 
>>> that, when serviced, executes the callback to the user code.
>>> 
>>> If you aren't seeing callbacks, the most likely cause is that the orte 
>>> progress thread isn't running. Without it, none of this will work.
>> 
>> Thanks. Running configure without '--with-ft=cr' I can run a program and
>> use orte-top. In orterun I can see that the callback is running and
>> orte-top displays the retrieved information. I can also see in orte-top
>> that the callbacks are working.
> 
> Actually, I'm rather impressed - I hadn't tested orte-top and didn't honestly 
> know if it would work any more! Glad to hear it does :-)
> 
>> Doing the same with '--with-ft=cr'
>> enabled orte-top crashes as well as orte-checkpoint and both (-top and
>> -checkpoint) seem to no longer have working callbacks and that is why
>> they are probably crashing. So some code which is enabled by '--with-ft=cr'
>> seems to break callbacks in orte-top as well as in orte-checkpoint.
>> orterun handles callbacks no matter if configured with or without
>> '--with-ft=cr'.
> 
> I can take a look this weekend - probably something silly
> 
>> 
>>  Adrian
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] callback debugging

2014-01-20 Thread Adrian Reber
Thanks for your help. I tried initializing the barrier correctly (see
attached patch) but now, instead of crashing, it just hangs on the
barrier while running orte-checkpoint

[dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
[dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206

#0  0x769befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x77b456ba in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:207
#2  0x77b3a582 in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207

it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);

I do not understand on what the barrier here is actually waiting for. Where
do I need to look to find the place the barrier is waiting for?

I also tried initializing the collective id's in
orte/mca/plm/base/plm_base_launch_support.c but that code is never
used running the orte-checkpoint tool

Adrian

On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> I took a look at this, and I'm afraid you have some work to do in the 
> orte/mca/snapc code base:
> 
> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
> r30261 for an example of the changes that need to be made - I did some, but 
> can't swear to catching them all. It was enough to at least get a proc past 
> the initial snapc registration
> 
> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
> calls - those ids are used elsewhere during MPI_Init. This is not allowed - a 
> collective id can only be used *once*. What you need to do is go into 
> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add 
> cr-specific collective id's for this purpose. I don't know how many places in 
> the cr code create their own barriers, but they each need a collective id.
> 
> If you prefer and have the time, you are welcome to extend the collective 
> code to allow id reuse. This would require that each daemon and app "reset" 
> the collective fields when a collective is declared complete. It isn't that 
> hard to do - just never had a reason to do it. I can take a shot at it when 
> time permits (may have some time this weekend)
> 
> 3. when you post the non-blocking recv in the snapc/full code, it looks to me 
> like you need to block until you get the answer. I don't know where in the 
> code flow this is occurring - if you are not in an event, then it is okay to 
> block using ORTE_WAIT_FOR_COMPLETION. Look in 
> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> 
> HTH
> Ralph
> 
> On Jan 10, 2014, at 12:55 PM, Ralph Castain  wrote:
> 
> > 
> > On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:
> > 
> >> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> >>> 
> >>> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
> >>> 
>  I am currently trying to understand how callbacks are working. Right now
>  I am looking at orte/mca/rml/base/rml_base_receive.c
>  orte_rml_base_comm_start() which does 
>  
>    orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>    ORTE_RML_TAG_RML_INFO_UPDATE,
>    ORTE_RML_PERSISTENT,
>    orte_rml_base_recv,
>    NULL);
>  
>  As far as I understand it orte_rml_base_recv() is the callback function.
>  At which point should this function run? When the data is actually
>  received?
> >>> 
> >>> Not precisely. When data is received by the OOB, it pushes the data into 
> >>> an event. When that event gets serviced, it calls the 
> >>> orte_rml_base_receive function which processes the data to find the 
> >>> matching tag, and then uses that to execute the callback to the user code.
> >>> 
>  
>  The same for send_buffer_nb() functions. I do not see the callback
>  functions actually running. How can I verify that the callback functions
>  are running. Especially for the send case it sounds pretty obvious how
>  it should work but I never see the callback function running. At least
>  in my setup.
> >>> 
> >>> The data is not immediately sent. It gets pushed into an event. When that 
> >>> event gets serviced, it calls the orte_oob_base_send function which then 
> >>> passes the data to each active OOB component until one of them says it 
> >>> can send it. The data is then pushed into another event to get it into 
> >>> the event base for that component's active module - when that event gets 
> >>> serviced, the data is sent. Once the data is sent, an event is created 
> >>> that, when serviced, executes the callback to the user code.
> >>> 
> >>> If you aren't seeing callbacks, the most likely cause is that the orte 
> >>> progress thread isn't running. Without it, none of this will work.
> >> 
> >> Thanks

Re: [OMPI devel] callback debugging

2014-01-20 Thread Ralph Castain
Is it orte-checkpoint that is hanging, or the app you are trying to checkpoint?


On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:

> Thanks for your help. I tried initializing the barrier correctly (see
> attached patch) but now, instead of crashing, it just hangs on the
> barrier while running orte-checkpoint
> 
> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> 
> #0  0x769befa0 in __nanosleep_nocancel () at 
> ../sysdeps/unix/syscall-template.S:81
> #1  0x77b456ba in app_coord_init () at 
> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> #2  0x77b3a582 in orte_snapc_full_module_init (seed=false, app=true) 
> at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> 
> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> 
> I do not understand on what the barrier here is actually waiting for. Where
> do I need to look to find the place the barrier is waiting for?
> 
> I also tried initializing the collective id's in
> orte/mca/plm/base/plm_base_launch_support.c but that code is never
> used running the orte-checkpoint tool
> 
>   Adrian
> 
> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
>> I took a look at this, and I'm afraid you have some work to do in the 
>> orte/mca/snapc code base:
>> 
>> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
>> r30261 for an example of the changes that need to be made - I did some, but 
>> can't swear to catching them all. It was enough to at least get a proc past 
>> the initial snapc registration
>> 
>> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
>> calls - those ids are used elsewhere during MPI_Init. This is not allowed - 
>> a collective id can only be used *once*. What you need to do is go into 
>> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add 
>> cr-specific collective id's for this purpose. I don't know how many places 
>> in the cr code create their own barriers, but they each need a collective id.
>> 
>> If you prefer and have the time, you are welcome to extend the collective 
>> code to allow id reuse. This would require that each daemon and app "reset" 
>> the collective fields when a collective is declared complete. It isn't that 
>> hard to do - just never had a reason to do it. I can take a shot at it when 
>> time permits (may have some time this weekend)
>> 
>> 3. when you post the non-blocking recv in the snapc/full code, it looks to 
>> me like you need to block until you get the answer. I don't know where in 
>> the code flow this is occurring - if you are not in an event, then it is 
>> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in 
>> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
>> 
>> HTH
>> Ralph
>> 
>> On Jan 10, 2014, at 12:55 PM, Ralph Castain  wrote:
>> 
>>> 
>>> On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:
>>> 
 On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> 
> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
> 
>> I am currently trying to understand how callbacks are working. Right now
>> I am looking at orte/mca/rml/base/rml_base_receive.c
>> orte_rml_base_comm_start() which does 
>> 
>>  orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>>  ORTE_RML_TAG_RML_INFO_UPDATE,
>>  ORTE_RML_PERSISTENT,
>>  orte_rml_base_recv,
>>  NULL);
>> 
>> As far as I understand it orte_rml_base_recv() is the callback function.
>> At which point should this function run? When the data is actually
>> received?
> 
> Not precisely. When data is received by the OOB, it pushes the data into 
> an event. When that event gets serviced, it calls the 
> orte_rml_base_receive function which processes the data to find the 
> matching tag, and then uses that to execute the callback to the user code.
> 
>> 
>> The same for send_buffer_nb() functions. I do not see the callback
>> functions actually running. How can I verify that the callback functions
>> are running. Especially for the send case it sounds pretty obvious how
>> it should work but I never see the callback function running. At least
>> in my setup.
> 
> The data is not immediately sent. It gets pushed into an event. When that 
> event gets serviced, it calls the orte_oob_base_send function which then 
> passes the data to each active OOB component until one of them says it 
> can send it. The data is then pushed into another event to get it into 
> the event base for that component's active module - when that event gets 
> serviced, the data is sent. Once the data is sent, an event is created 
> that, when serviced, exe

Re: [OMPI devel] callback debugging

2014-01-20 Thread Josh Hursey
If it is the application, then there is probably a barrier in the
app_coord_init() to make sure all the applications are up and running.
After this point then the global coordinator knows that the application can
be checkpointed.

I don't think orte-checkpoint should be calling a barrier - from what I
recall.


On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain  wrote:

> Is it orte-checkpoint that is hanging, or the app you are trying to
> checkpoint?
>
>
> On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
>
> Thanks for your help. I tried initializing the barrier correctly (see
> attached patch) but now, instead of crashing, it just hangs on the
> barrier while running orte-checkpoint
>
> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
>
> #0  0x769befa0 in __nanosleep_nocancel () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x77b456ba in app_coord_init () at
> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> #2  0x77b3a582 in orte_snapc_full_module_init (seed=false,
> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
>
> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
>
> I do not understand on what the barrier here is actually waiting for. Where
> do I need to look to find the place the barrier is waiting for?
>
> I also tried initializing the collective id's in
> orte/mca/plm/base/plm_base_launch_support.c but that code is never
> used running the orte-checkpoint tool
>
>  Adrian
>
> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
>
> I took a look at this, and I'm afraid you have some work to do in the
> orte/mca/snapc code base:
>
> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See
> r30261 for an example of the changes that need to be made - I did some, but
> can't swear to catching them all. It was enough to at least get a proc past
> the initial snapc registration
>
> 2. you are reusing collective id's to execute several orte_grpcomm.barrier
> calls - those ids are used elsewhere during MPI_Init. This is not allowed -
> a collective id can only be used *once*. What you need to do is go into
> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add
> cr-specific collective id's for this purpose. I don't know how many places
> in the cr code create their own barriers, but they each need a collective
> id.
>
> If you prefer and have the time, you are welcome to extend the collective
> code to allow id reuse. This would require that each daemon and app "reset"
> the collective fields when a collective is declared complete. It isn't that
> hard to do - just never had a reason to do it. I can take a shot at it when
> time permits (may have some time this weekend)
>
> 3. when you post the non-blocking recv in the snapc/full code, it looks to
> me like you need to block until you get the answer. I don't know where in
> the code flow this is occurring - if you are not in an event, then it is
> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in
> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
>
> HTH
> Ralph
>
> On Jan 10, 2014, at 12:55 PM, Ralph Castain  wrote:
>
>
> On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:
>
> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>
>
> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
>
> I am currently trying to understand how callbacks are working. Right now
> I am looking at orte/mca/rml/base/rml_base_receive.c
> orte_rml_base_comm_start() which does
>
>  orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>  ORTE_RML_TAG_RML_INFO_UPDATE,
>  ORTE_RML_PERSISTENT,
>  orte_rml_base_recv,
>  NULL);
>
> As far as I understand it orte_rml_base_recv() is the callback function.
> At which point should this function run? When the data is actually
> received?
>
>
> Not precisely. When data is received by the OOB, it pushes the data into
> an event. When that event gets serviced, it calls the orte_rml_base_receive
> function which processes the data to find the matching tag, and then uses
> that to execute the callback to the user code.
>
>
> The same for send_buffer_nb() functions. I do not see the callback
> functions actually running. How can I verify that the callback functions
> are running. Especially for the send case it sounds pretty obvious how
> it should work but I never see the callback function running. At least
> in my setup.
>
>
> The data is not immediately sent. It gets pushed into an event. When that
> event gets serviced, it calls the orte_oob_base_send function which then
> passes the data to each active OOB component until one of them says it can
> send it. The data is then pushed into another event to get it into the
> event base for that component's active module - when

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
orte-checkpoint before communicating with orterun which runs the
processes I am trying to checkpoint. The full backtrace:

#0  0x769befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x77b45712 in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:208
#2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207
#3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
../../../../orte/mca/snapc/base/snapc_base_select.c:96
#4  0x77a9884a in orte_ess_base_tool_setup () at 
../../../../orte/mca/ess/base/ess_base_std_tool.c:192
#5  0x77a9fe85 in rte_init () at 
../../../../../orte/mca/ess/tool/ess_tool_module.c:83
#6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
#7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
#8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245


On Mon, Jan 20, 2014 at 02:46:04PM -0800, Ralph Castain wrote:
> Is it orte-checkpoint that is hanging, or the app you are trying to 
> checkpoint?
> 
> 
> On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
> 
> > Thanks for your help. I tried initializing the barrier correctly (see
> > attached patch) but now, instead of crashing, it just hangs on the
> > barrier while running orte-checkpoint
> > 
> > [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> > [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at 
> > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> > 
> > #0  0x769befa0 in __nanosleep_nocancel () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x77b456ba in app_coord_init () at 
> > ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> > #2  0x77b3a582 in orte_snapc_full_module_init (seed=false, 
> > app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> > 
> > it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> > 
> > I do not understand on what the barrier here is actually waiting for. Where
> > do I need to look to find the place the barrier is waiting for?
> > 
> > I also tried initializing the collective id's in
> > orte/mca/plm/base/plm_base_launch_support.c but that code is never
> > used running the orte-checkpoint tool
> > 
> > Adrian
> > 
> > On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> >> I took a look at this, and I'm afraid you have some work to do in the 
> >> orte/mca/snapc code base:
> >> 
> >> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See 
> >> r30261 for an example of the changes that need to be made - I did some, 
> >> but can't swear to catching them all. It was enough to at least get a proc 
> >> past the initial snapc registration
> >> 
> >> 2. you are reusing collective id's to execute several orte_grpcomm.barrier 
> >> calls - those ids are used elsewhere during MPI_Init. This is not allowed 
> >> - a collective id can only be used *once*. What you need to do is go into 
> >> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) 
> >> add cr-specific collective id's for this purpose. I don't know how many 
> >> places in the cr code create their own barriers, but they each need a 
> >> collective id.
> >> 
> >> If you prefer and have the time, you are welcome to extend the collective 
> >> code to allow id reuse. This would require that each daemon and app 
> >> "reset" the collective fields when a collective is declared complete. It 
> >> isn't that hard to do - just never had a reason to do it. I can take a 
> >> shot at it when time permits (may have some time this weekend)
> >> 
> >> 3. when you post the non-blocking recv in the snapc/full code, it looks to 
> >> me like you need to block until you get the answer. I don't know where in 
> >> the code flow this is occurring - if you are not in an event, then it is 
> >> okay to block using ORTE_WAIT_FOR_COMPLETION. Look in 
> >> orte/mca/routed/base/routed_base_fns.c starting at line 252 for an example.
> >> 
> >> HTH
> >> Ralph
> >> 
> >> On Jan 10, 2014, at 12:55 PM, Ralph Castain  wrote:
> >> 
> >>> 
> >>> On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:
> >>> 
>  On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> > 
> > On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
> > 
> >> I am currently trying to understand how callbacks are working. Right 
> >> now
> >> I am looking at orte/mca/rml/base/rml_base_receive.c
> >> orte_rml_base_comm_start() which does 
> >> 
> >>  orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >>  ORTE_RML_TAG_RML_INFO_UPDATE,
> >>  ORTE_RML_PERSISTENT,
> >

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
I think I still do not really understand how it works.

The barrier on which orte-checkpoint is currently hanging is in
app_coord_init(). You are also saying that orte-checkpoint
should not be calling a barrier. The backtrace of the point where it
is hanging now looks like:

#0  0x769befa0 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x77b45712 in app_coord_init () at 
../../../../../orte/mca/snapc/full/snapc_full_app.c:208
#2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, app=true) at 
../../../../../orte/mca/snapc/full/snapc_full_module.c:207
#3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
../../../../orte/mca/snapc/base/snapc_base_select.c:96
#4  0x77a9884a in orte_ess_base_tool_setup () at 
../../../../orte/mca/ess/base/ess_base_std_tool.c:192
#5  0x77a9fe85 in rte_init () at 
../../../../../orte/mca/ess/tool/ess_tool_module.c:83
#6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
#7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
#8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245

Maybe I am doing something completely wrong. I am currently
running 'orterun -np 2 test-programm'.

In another terminal I am starting orte-checkpoint with the PID of
orterun and the barrier in app_coord_init() is just before it tries
to communicate with orterun. Is this the correct setup?

Adrian

On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
> If it is the application, then there is probably a barrier in the
> app_coord_init() to make sure all the applications are up and running.
> After this point then the global coordinator knows that the application can
> be checkpointed.
> 
> I don't think orte-checkpoint should be calling a barrier - from what I
> recall.
> 
> 
> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain  wrote:
> 
> > Is it orte-checkpoint that is hanging, or the app you are trying to
> > checkpoint?
> >
> >
> > On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
> >
> > Thanks for your help. I tried initializing the barrier correctly (see
> > attached patch) but now, instead of crashing, it just hangs on the
> > barrier while running orte-checkpoint
> >
> > [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> > [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> >
> > #0  0x769befa0 in __nanosleep_nocancel () at
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x77b456ba in app_coord_init () at
> > ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> > #2  0x77b3a582 in orte_snapc_full_module_init (seed=false,
> > app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> >
> > it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> >
> > I do not understand on what the barrier here is actually waiting for. Where
> > do I need to look to find the place the barrier is waiting for?
> >
> > I also tried initializing the collective id's in
> > orte/mca/plm/base/plm_base_launch_support.c but that code is never
> > used running the orte-checkpoint tool
> >
> >  Adrian
> >
> > On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
> >
> > I took a look at this, and I'm afraid you have some work to do in the
> > orte/mca/snapc code base:
> >
> > 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See
> > r30261 for an example of the changes that need to be made - I did some, but
> > can't swear to catching them all. It was enough to at least get a proc past
> > the initial snapc registration
> >
> > 2. you are reusing collective id's to execute several orte_grpcomm.barrier
> > calls - those ids are used elsewhere during MPI_Init. This is not allowed -
> > a collective id can only be used *once*. What you need to do is go into
> > orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add
> > cr-specific collective id's for this purpose. I don't know how many places
> > in the cr code create their own barriers, but they each need a collective
> > id.
> >
> > If you prefer and have the time, you are welcome to extend the collective
> > code to allow id reuse. This would require that each daemon and app "reset"
> > the collective fields when a collective is declared complete. It isn't that
> > hard to do - just never had a reason to do it. I can take a shot at it when
> > time permits (may have some time this weekend)
> >
> > 3. when you post the non-blocking recv in the snapc/full code, it looks to
> > me like you need to block until you get the answer. I don't know where in
> > the code flow this is occurring - if you are not in an event, then it is
> > okay to block using ORTE_WAIT_FOR_COMPLET

Re: [OMPI devel] callback debugging

2014-01-21 Thread Ralph Castain
That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
itself to be running a barrier. I wonder if it is selecting the wrong component 
in snapc?

As for the patch, that isn't going to work. The collective id has to be 
*globally* unique, which means that only orterun can issue a new one. So you 
have to get thru orte_init before you can request one as it requires a 
communication.

However, like I said, it makes no sense for orte-checkpoint to do a barrier as 
it is a singleton - there is nothing for it to "barrier" with.

On Jan 21, 2014, at 7:24 AM, Adrian Reber  wrote:

> I think I still do not really understand how it works.
> 
> The barrier on which orte-checkpoint is currently hanging is in
> app_coord_init(). You are also saying that orte-checkpoint
> should not be calling a barrier. The backtrace of the point where it
> is hanging now looks like:
> 
> #0  0x769befa0 in __nanosleep_nocancel () at 
> ../sysdeps/unix/syscall-template.S:81
> #1  0x77b45712 in app_coord_init () at 
> ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
> #2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, app=true) 
> at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> #3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
> ../../../../orte/mca/snapc/base/snapc_base_select.c:96
> #4  0x77a9884a in orte_ess_base_tool_setup () at 
> ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
> #5  0x77a9fe85 in rte_init () at 
> ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
> #6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
> pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
> #7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
> #8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
> 
> Maybe I am doing something completely wrong. I am currently
> running 'orterun -np 2 test-programm'.
> 
> In another terminal I am starting orte-checkpoint with the PID of
> orterun and the barrier in app_coord_init() is just before it tries
> to communicate with orterun. Is this the correct setup?
> 
>   Adrian
> 
> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
>> If it is the application, then there is probably a barrier in the
>> app_coord_init() to make sure all the applications are up and running.
>> After this point then the global coordinator knows that the application can
>> be checkpointed.
>> 
>> I don't think orte-checkpoint should be calling a barrier - from what I
>> recall.
>> 
>> 
>> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain  wrote:
>> 
>>> Is it orte-checkpoint that is hanging, or the app you are trying to
>>> checkpoint?
>>> 
>>> 
>>> On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
>>> 
>>> Thanks for your help. I tried initializing the barrier correctly (see
>>> attached patch) but now, instead of crashing, it just hangs on the
>>> barrier while running orte-checkpoint
>>> 
>>> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
>>> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
>>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
>>> 
>>> #0  0x769befa0 in __nanosleep_nocancel () at
>>> ../sysdeps/unix/syscall-template.S:81
>>> #1  0x77b456ba in app_coord_init () at
>>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
>>> #2  0x77b3a582 in orte_snapc_full_module_init (seed=false,
>>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
>>> 
>>> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
>>> 
>>> I do not understand on what the barrier here is actually waiting for. Where
>>> do I need to look to find the place the barrier is waiting for?
>>> 
>>> I also tried initializing the collective id's in
>>> orte/mca/plm/base/plm_base_launch_support.c but that code is never
>>> used running the orte-checkpoint tool
>>> 
>>> Adrian
>>> 
>>> On Sat, Jan 11, 2014 at 11:46:42AM -0800, Ralph Castain wrote:
>>> 
>>> I took a look at this, and I'm afraid you have some work to do in the
>>> orte/mca/snapc code base:
>>> 
>>> 1. you must use dynamically allocated buffers for rml.send_buffer_nb. See
>>> r30261 for an example of the changes that need to be made - I did some, but
>>> can't swear to catching them all. It was enough to at least get a proc past
>>> the initial snapc registration
>>> 
>>> 2. you are reusing collective id's to execute several orte_grpcomm.barrier
>>> calls - those ids are used elsewhere during MPI_Init. This is not allowed -
>>> a collective id can only be used *once*. What you need to do is go into
>>> orte/mca/plm/base/plm_base_launch_support.c and (when cr is configured) add
>>> cr-specific collective id's for this purpose. I don't know how many places
>>> in the cr code create their own barriers, b

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
Good to know that it does not make any sense. So it not just me.

Looking at the call chain I can see

orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);

and the second parameter is used to decide if it is an app or not:

int orte_snapc_base_select(bool seed, bool app) in 
orte/mca/snapc/base/snapc_base_select.c

and if it is true the code with the barrier is used.

In orte/mca/snapc/base/snapc_base_select.c there is also following
comment:

/* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once the 
framework system is in place */

Is this something which needs to be changed and which might be the cause
for this problem?


On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
> That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
> itself to be running a barrier. I wonder if it is selecting the wrong 
> component in snapc?
> 
> As for the patch, that isn't going to work. The collective id has to be 
> *globally* unique, which means that only orterun can issue a new one. So you 
> have to get thru orte_init before you can request one as it requires a 
> communication.
> 
> However, like I said, it makes no sense for orte-checkpoint to do a barrier 
> as it is a singleton - there is nothing for it to "barrier" with.
> 
> On Jan 21, 2014, at 7:24 AM, Adrian Reber  wrote:
> 
> > I think I still do not really understand how it works.
> > 
> > The barrier on which orte-checkpoint is currently hanging is in
> > app_coord_init(). You are also saying that orte-checkpoint
> > should not be calling a barrier. The backtrace of the point where it
> > is hanging now looks like:
> > 
> > #0  0x769befa0 in __nanosleep_nocancel () at 
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x77b45712 in app_coord_init () at 
> > ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
> > #2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, 
> > app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> > #3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
> > ../../../../orte/mca/snapc/base/snapc_base_select.c:96
> > #4  0x77a9884a in orte_ess_base_tool_setup () at 
> > ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
> > #5  0x77a9fe85 in rte_init () at 
> > ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
> > #6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
> > pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
> > #7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
> > ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
> > #8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
> > ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
> > 
> > Maybe I am doing something completely wrong. I am currently
> > running 'orterun -np 2 test-programm'.
> > 
> > In another terminal I am starting orte-checkpoint with the PID of
> > orterun and the barrier in app_coord_init() is just before it tries
> > to communicate with orterun. Is this the correct setup?
> > 
> > Adrian
> > 
> > On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
> >> If it is the application, then there is probably a barrier in the
> >> app_coord_init() to make sure all the applications are up and running.
> >> After this point then the global coordinator knows that the application can
> >> be checkpointed.
> >> 
> >> I don't think orte-checkpoint should be calling a barrier - from what I
> >> recall.
> >> 
> >> 
> >> On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain  wrote:
> >> 
> >>> Is it orte-checkpoint that is hanging, or the app you are trying to
> >>> checkpoint?
> >>> 
> >>> 
> >>> On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
> >>> 
> >>> Thanks for your help. I tried initializing the barrier correctly (see
> >>> attached patch) but now, instead of crashing, it just hangs on the
> >>> barrier while running orte-checkpoint
> >>> 
> >>> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> >>> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> >>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> >>> 
> >>> #0  0x769befa0 in __nanosleep_nocancel () at
> >>> ../sysdeps/unix/syscall-template.S:81
> >>> #1  0x77b456ba in app_coord_init () at
> >>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> >>> #2  0x77b3a582 in orte_snapc_full_module_init (seed=false,
> >>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> >>> 
> >>> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> >>> 
> >>> I do not understand on what the barrier here is actually waiting for. 
> >>> Where
> >>> do I need to look to find the place the barrier is waiting for?
> >>> 
> >>> I also tried initializing the collective id's in
> >>> orte/mca/plm/base/plm_base_launch_support.c but that code is never
> >>> used running the orte-checkpoint tool
> >>> 
> 

Re: [OMPI devel] callback debugging

2014-01-21 Thread Ralph Castain
That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no !). 
The problem is that orte-checkpoint is a tool, and so it isn't a daemon - but 
it is also not an app.


On Jan 21, 2014, at 11:56 AM, Adrian Reber  wrote:

> Good to know that it does not make any sense. So it not just me.
> 
> Looking at the call chain I can see
> 
> orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);
> 
> and the second parameter is used to decide if it is an app or not:
> 
> int orte_snapc_base_select(bool seed, bool app) in 
> orte/mca/snapc/base/snapc_base_select.c
> 
> and if it is true the code with the barrier is used.
> 
> In orte/mca/snapc/base/snapc_base_select.c there is also following
> comment:
> 
> /* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once the 
> framework system is in place */
> 
> Is this something which needs to be changed and which might be the cause
> for this problem?
> 
> 
> On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
>> That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
>> itself to be running a barrier. I wonder if it is selecting the wrong 
>> component in snapc?
>> 
>> As for the patch, that isn't going to work. The collective id has to be 
>> *globally* unique, which means that only orterun can issue a new one. So you 
>> have to get thru orte_init before you can request one as it requires a 
>> communication.
>> 
>> However, like I said, it makes no sense for orte-checkpoint to do a barrier 
>> as it is a singleton - there is nothing for it to "barrier" with.
>> 
>> On Jan 21, 2014, at 7:24 AM, Adrian Reber  wrote:
>> 
>>> I think I still do not really understand how it works.
>>> 
>>> The barrier on which orte-checkpoint is currently hanging is in
>>> app_coord_init(). You are also saying that orte-checkpoint
>>> should not be calling a barrier. The backtrace of the point where it
>>> is hanging now looks like:
>>> 
>>> #0  0x769befa0 in __nanosleep_nocancel () at 
>>> ../sysdeps/unix/syscall-template.S:81
>>> #1  0x77b45712 in app_coord_init () at 
>>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
>>> #2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, 
>>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
>>> #3  0x77b375de in orte_snapc_base_select (seed=false, app=true) at 
>>> ../../../../orte/mca/snapc/base/snapc_base_select.c:96
>>> #4  0x77a9884a in orte_ess_base_tool_setup () at 
>>> ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
>>> #5  0x77a9fe85 in rte_init () at 
>>> ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
>>> #6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
>>> pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
>>> #7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
>>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
>>> #8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
>>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
>>> 
>>> Maybe I am doing something completely wrong. I am currently
>>> running 'orterun -np 2 test-programm'.
>>> 
>>> In another terminal I am starting orte-checkpoint with the PID of
>>> orterun and the barrier in app_coord_init() is just before it tries
>>> to communicate with orterun. Is this the correct setup?
>>> 
>>> Adrian
>>> 
>>> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
 If it is the application, then there is probably a barrier in the
 app_coord_init() to make sure all the applications are up and running.
 After this point then the global coordinator knows that the application can
 be checkpointed.
 
 I don't think orte-checkpoint should be calling a barrier - from what I
 recall.
 
 
 On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain  wrote:
 
> Is it orte-checkpoint that is hanging, or the app you are trying to
> checkpoint?
> 
> 
> On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
> 
> Thanks for your help. I tried initializing the barrier correctly (see
> attached patch) but now, instead of crashing, it just hangs on the
> barrier while running orte-checkpoint
> 
> [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> 
> #0  0x769befa0 in __nanosleep_nocancel () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x77b456ba in app_coord_init () at
> ../../../../../orte/mca/snapc/full/snapc_full_app.c:207
> #2  0x77b3a582 in orte_snapc_full_module_init (seed=false,
> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> 
> it hangs looping at ORTE_WAIT_FOR_COMPLETION(coll->active);
> 
> I do not understand on what the barrier her

Re: [OMPI devel] callback debugging

2014-01-21 Thread Adrian Reber
Thanks, that helps. Now it actually starts to communicate with the
orterun process. This still fails but I will try to fix it.

On Tue, Jan 21, 2014 at 12:27:55PM -0800, Ralph Castain wrote:
> That second argument is incorrect - it should be ORTE_PROC_IS_APP (note no 
> !). The problem is that orte-checkpoint is a tool, and so it isn't a daemon - 
> but it is also not an app.
> 
> 
> On Jan 21, 2014, at 11:56 AM, Adrian Reber  wrote:
> 
> > Good to know that it does not make any sense. So it not just me.
> > 
> > Looking at the call chain I can see
> > 
> > orte_snapc_base_select(ORTE_PROC_IS_HNP, !ORTE_PROC_IS_DAEMON);
> > 
> > and the second parameter is used to decide if it is an app or not:
> > 
> > int orte_snapc_base_select(bool seed, bool app) in 
> > orte/mca/snapc/base/snapc_base_select.c
> > 
> > and if it is true the code with the barrier is used.
> > 
> > In orte/mca/snapc/base/snapc_base_select.c there is also following
> > comment:
> > 
> > /* XXX -- TODO -- framework_subsytem -- this shouldn't be necessary once 
> > the framework system is in place */
> > 
> > Is this something which needs to be changed and which might be the cause
> > for this problem?
> > 
> > 
> > On Tue, Jan 21, 2014 at 07:27:32AM -0800, Ralph Castain wrote:
> >> That doesn't make any sense - I can't imagine a reason for orte-checkpoint 
> >> itself to be running a barrier. I wonder if it is selecting the wrong 
> >> component in snapc?
> >> 
> >> As for the patch, that isn't going to work. The collective id has to be 
> >> *globally* unique, which means that only orterun can issue a new one. So 
> >> you have to get thru orte_init before you can request one as it requires a 
> >> communication.
> >> 
> >> However, like I said, it makes no sense for orte-checkpoint to do a 
> >> barrier as it is a singleton - there is nothing for it to "barrier" with.
> >> 
> >> On Jan 21, 2014, at 7:24 AM, Adrian Reber  wrote:
> >> 
> >>> I think I still do not really understand how it works.
> >>> 
> >>> The barrier on which orte-checkpoint is currently hanging is in
> >>> app_coord_init(). You are also saying that orte-checkpoint
> >>> should not be calling a barrier. The backtrace of the point where it
> >>> is hanging now looks like:
> >>> 
> >>> #0  0x769befa0 in __nanosleep_nocancel () at 
> >>> ../sysdeps/unix/syscall-template.S:81
> >>> #1  0x77b45712 in app_coord_init () at 
> >>> ../../../../../orte/mca/snapc/full/snapc_full_app.c:208
> >>> #2  0x77b3a5ce in orte_snapc_full_module_init (seed=false, 
> >>> app=true) at ../../../../../orte/mca/snapc/full/snapc_full_module.c:207
> >>> #3  0x77b375de in orte_snapc_base_select (seed=false, app=true) 
> >>> at ../../../../orte/mca/snapc/base/snapc_base_select.c:96
> >>> #4  0x77a9884a in orte_ess_base_tool_setup () at 
> >>> ../../../../orte/mca/ess/base/ess_base_std_tool.c:192
> >>> #5  0x77a9fe85 in rte_init () at 
> >>> ../../../../../orte/mca/ess/tool/ess_tool_module.c:83
> >>> #6  0x77a4647f in orte_init (pargc=0x7fffd94c, 
> >>> pargv=0x7fffd940, flags=8) at ../../orte/runtime/orte_init.c:158
> >>> #7  0x00402859 in ckpt_init (argc=51, argv=0x7fffda78) at 
> >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:610
> >>> #8  0x00401d7a in main (argc=51, argv=0x7fffda78) at 
> >>> ../../../../orte/tools/orte-checkpoint/orte-checkpoint.c:245
> >>> 
> >>> Maybe I am doing something completely wrong. I am currently
> >>> running 'orterun -np 2 test-programm'.
> >>> 
> >>> In another terminal I am starting orte-checkpoint with the PID of
> >>> orterun and the barrier in app_coord_init() is just before it tries
> >>> to communicate with orterun. Is this the correct setup?
> >>> 
> >>>   Adrian
> >>> 
> >>> On Mon, Jan 20, 2014 at 05:33:59PM -0600, Josh Hursey wrote:
>  If it is the application, then there is probably a barrier in the
>  app_coord_init() to make sure all the applications are up and running.
>  After this point then the global coordinator knows that the application 
>  can
>  be checkpointed.
>  
>  I don't think orte-checkpoint should be calling a barrier - from what I
>  recall.
>  
>  
>  On Mon, Jan 20, 2014 at 4:46 PM, Ralph Castain  wrote:
>  
> > Is it orte-checkpoint that is hanging, or the app you are trying to
> > checkpoint?
> > 
> > 
> > On Jan 20, 2014, at 2:10 PM, Adrian Reber  wrote:
> > 
> > Thanks for your help. I tried initializing the barrier correctly (see
> > attached patch) but now, instead of crashing, it just hangs on the
> > barrier while running orte-checkpoint
> > 
> > [dcbz:20150] [[41665,0],0] grpcomm:bad entering barrier
> > [dcbz:20150] [[41665,0],0] ACTIVATING GRCPCOMM OP 0 at
> > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c:206
> > 
> > #0  0x769befa0 in __nanosleep_nocancel () at
> > ../sysdeps/unix/sysc