Hi,
I'm trying to develop a btl for a custom NIC. I studied the btl.h file
to understand the flow of calls that are expected to be implemented in
my component. I'm using a simple test (which works like a charm with the
TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:
MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int ping_pong_count = 1;
int partner_rank = (world_rank + 1) % 2;
printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
if (world_rank == 0) {
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
} else {
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
world_rank, ping_pong_count, partner_rank);
}
MPI_Finalize();
I see that in my component's btl code the functions called during the
"MPI_send" phase are:
1. mca_btl_mycomp_add_procs
2. mca_btl_mycomp_prepare_src
3. mca_btl_mycomp_send (where I set the return to 1, so the send phase
should be finished)
I see then the print inside the test:
0 sent and incremented ping_pong_count 2 to 1
and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:
mca_btl_active_message_callback_t *reg =
mca_btl_base_active_message_trigger + tag;
reg->cbfunc(&my_btl->super, &desc);
I saw the same code in all the other BTLs and I thought this was enough
to "unlock" the MPI_Recv "polling". But actually I see my test hangs,
probably "waiting" for something that never happens (?).
I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match
function (which I suppose to be the reg->cbfunc), and it seems to get to
the end of the function, actually matching my frag.
So my question is: how can I say to the framework that I finished my
work and so the function can return to the user application? What am I
doing wrong?
Is there a way to understand where and what my code is waiting for?
Best