Re: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)

2022-08-03 Thread Jeff Squyres (jsquyres) via devel
Sorry for the delay in replies -- it's summer / vacation season, and I think we 
(as a community) are a little behind in answering some of these emails.  :-(

It's hard to say for any given machine, but a bunch of different hardware 
factors can come into play, such as:

- L1, L2, L3 cache sizes
- Cache contention
- Memory controller connectivity and locality

I.e., exactly which hardware resources are the memcpy()'s in question using, 
and how do they interact with each other?  How much overhead is produced, 
and/or how much contention ensues when multiple requests are in flight 
simultaneously?  For example, it may be counter-intuitive, but sometimes 
injecting a small amount of delay in a software pipeline can allow hardware 
resources to not become overwhelmed, and therefore the overall execution 
becomes more efficient, and therefore consume less wall-clock execution time.  
Hence, doing 2 x 1MB memcpy()'s (to effect a 2MB MPI_send) may actually be 
overall more efficient, even though the individual parts of the transaction are 
less efficient.  This is a complete guess, and may have nothing to do with your 
system, but it's one of many possibilities.

Another possible factor: the specific memcpy() implementation is highly 
relevant.  It's been a few years since I've paid close attention to memcpy(), 
but at one time, there was significant variation in the quality of memcpy() 
implementations between different compilers and/or versions of libc.  I don't 
know if this is still a factor, or whether memcpy() is pretty well optimized in 
most situations these days.  Additionally, alignment can be an issue (although 
for message sizes of 2MB, I'm guessing your buffer is page-aligned, and this 
probably isn't an issue).

All that being said, I'm not intimately familiar with the internals of XPMEM, 
so I don't know what userspace/kernel space mechanisms will come into play for 
mapping the shared memory (e.g., is it lazily mapping the shared memory?).

Also, you're probably doing this already, but these kinds of things are worth 
mentioning: make sure your performance benchmarks are testing the right things: 
do warmup transfers, make sure you're not swapping, make sure all the processes 
and memory are pinned properly, make sure you're on an otherwise-quiet machine, 
... etc.  All the Usual Benchmarking Things.

--
Jeff Squyres
jsquy...@cisco.com


From: devel  on behalf of Giorgos Katevainis 
via devel 
Sent: Thursday, July 28, 2022 9:33 AM
To: Open MPI Developers
Cc: Giorgos Katevainis
Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)

Hello all,

I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which 
effectively causes
memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The 
comment reads:

"Maximum size to copy with a single call to memcpy. On some systems a smaller 
or larger number may
provide better performance (default: 256k)"

And I have indeed observed performance difference by adjusting it! E.g. in a 
simple point-to-point
test, 2 MB messages do significantly better with the parameter set to 1 MB vs 2 
MB. But... why? I
suppose I could imagine a memcpy of larger size being more efficient, but what 
would cause many
small ones to end up being quicker than a single large one? Might it have 
something to do with
memcpy intrinsics and different implementation for different sizes?

If someone knows what's going on under the hood and/or could direct me to any 
relevant resources, I
would greatly appreciate it!

George


Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development

2022-08-03 Thread Jeff Squyres (jsquyres) via devel
Sorry for the huge delay in replies -- it's summer / vacation season, and I 
think we (as a community) are a little behind in answering some of these 
emails.  :-(

It's been quite a while since I have been in the depths of BTL internals; I'm 
afraid I don't remember the details offhand.

When I was writing the usnic BTL, I know I found it useful to attach a debugger 
on the sending and/or receiving side processes, and actually step through both 
my BTL code and the OB1 PML code to see what was happening.  I frequently found 
that either my BTL wasn't correctly accounting for network conditions, or it 
wasn't passing information up to OB1 that it expected (e.g., it passed the 
wrong length, or the wrong ID number, or ...something else).  You can actually 
follow what happens in OB1 when your BTL invokes the cbfunc -- does it find a 
corresponding MPI_Request, and does it mark it complete?  Or does it put your 
incoming fragment as an unexpected message for some reason, and put it on the 
unexpected queue?  Look for that kind of stuff.

-- 
Jeff Squyres
jsquy...@cisco.com


From: devel  on behalf of Michele Martinelli 
via devel 
Sent: Saturday, July 23, 2022 9:04 AM
To: devel@lists.open-mpi.org
Cc: Michele Martinelli
Subject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under 
development

Hi,

I'm trying to develop a btl for a custom NIC. I studied the btl.h file
to understand the flow of calls that are expected to be implemented in
my component. I'm using a simple test (which works like a charm with the
TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:

   MPI_Init(NULL, NULL);
   int world_rank;
   MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
   int world_size;
   MPI_Comm_size(MPI_COMM_WORLD, &world_size);
   int ping_pong_count = 1;
   int partner_rank = (world_rank + 1) % 2;
   printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
 if (world_rank == 0) {
   ping_pong_count++;
   MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD);
   printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
 } else {
   MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
   printf("%d received ping_pong_count %d from %d\n",
  world_rank, ping_pong_count, partner_rank);
 }
   MPI_Finalize();

I see that in my component's btl code the functions called during the
"MPI_send" phase are:

  1. mca_btl_mycomp_add_procs
  2. mca_btl_mycomp_prepare_src
  3. mca_btl_mycomp_send (where I set the return to 1, so the send phase
 should be finished)

I see then the print inside the test:

 0 sent and incremented ping_pong_count 2 to 1

and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:

 mca_btl_active_message_callback_t *reg =
mca_btl_base_active_message_trigger + tag;
 reg->cbfunc(&my_btl->super, &desc);

I saw the same code in all the other BTLs and I thought this was enough
to "unlock" the MPI_Recv "polling". But actually I see my test hangs,
probably "waiting" for something that never happens (?).

I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match
function (which I suppose to be the reg->cbfunc), and it seems to get to
the end of the function, actually matching my frag.

So my question is: how can I say to the framework that I finished my
work and so the function can return to the user application? What am I
doing wrong?
Is there a way to understand where and what my code is waiting for?


Best



Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development

2022-08-03 Thread Michele Martinelli via devel
thank you for the answer. Actually I think I solved that problem some 
days ago, basically (if I correctly understand) MPI "adds" in some sense 
an header to the data sent (please correct me if I'm wrong), which is 
then used by ob1 to match the data arrived with the mpi_recv posted by 
the user. The problem was then a poorly reconstructed header on the 
receiving side.


unfortunately my happiness didn't last long because I have already found 
another problem: it seems that the peers are not actually exchanging the 
correct information via the modex protocol (not sure which kind of 
network connection they are using in that phase), receiving "local" data 
instead of the remote ones, but I just started debugging this, maybe I 
could open a new thread specific on this.


Michele

Il 03/08/22 15:43, Jeff Squyres (jsquyres) ha scritto:

Sorry for the huge delay in replies -- it's summer / vacation season, and I 
think we (as a community) are a little behind in answering some of these 
emails.  :-(

It's been quite a while since I have been in the depths of BTL internals; I'm 
afraid I don't remember the details offhand.

When I was writing the usnic BTL, I know I found it useful to attach a debugger 
on the sending and/or receiving side processes, and actually step through both 
my BTL code and the OB1 PML code to see what was happening.  I frequently found 
that either my BTL wasn't correctly accounting for network conditions, or it 
wasn't passing information up to OB1 that it expected (e.g., it passed the 
wrong length, or the wrong ID number, or ...something else).  You can actually 
follow what happens in OB1 when your BTL invokes the cbfunc -- does it find a 
corresponding MPI_Request, and does it mark it complete?  Or does it put your 
incoming fragment as an unexpected message for some reason, and put it on the 
unexpected queue?  Look for that kind of stuff.

--
Jeff Squyres
jsquy...@cisco.com


From: devel  on behalf of Michele Martinelli via 
devel 
Sent: Saturday, July 23, 2022 9:04 AM
To: devel@lists.open-mpi.org
Cc: Michele Martinelli
Subject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under 
development

Hi,

I'm trying to develop a btl for a custom NIC. I studied the btl.h file
to understand the flow of calls that are expected to be implemented in
my component. I'm using a simple test (which works like a charm with the
TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:

MPI_Init(NULL, NULL);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int ping_pong_count = 1;
int partner_rank = (world_rank + 1) % 2;
printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
  if (world_rank == 0) {
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count %d to %d\n",
world_rank, ping_pong_count, partner_rank);
  } else {
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
   world_rank, ping_pong_count, partner_rank);
  }
MPI_Finalize();

I see that in my component's btl code the functions called during the
"MPI_send" phase are:

   1. mca_btl_mycomp_add_procs
   2. mca_btl_mycomp_prepare_src
   3. mca_btl_mycomp_send (where I set the return to 1, so the send phase
  should be finished)

I see then the print inside the test:

  0 sent and incremented ping_pong_count 2 to 1

and this should conclude the MPI_Send phase.
Then I implemented in the btl_mycomp_component_progress function a call to:

  mca_btl_active_message_callback_t *reg =
mca_btl_base_active_message_trigger + tag;
  reg->cbfunc(&my_btl->super, &desc);

I saw the same code in all the other BTLs and I thought this was enough
to "unlock" the MPI_Recv "polling". But actually I see my test hangs,
probably "waiting" for something that never happens (?).

I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match
function (which I suppose to be the reg->cbfunc), and it seems to get to
the end of the function, actually matching my frag.

So my question is: how can I say to the framework that I finished my
work and so the function can return to the user application? What am I
doing wrong?
Is there a way to understand where and what my code is waiting for?


Best



Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development

2022-08-03 Thread Jeff Squyres (jsquyres) via devel
Glad you solved the first issue!

With respect to debugging, if you don't have a parallel debugger, you can do 
something like this: 
https://www.open-mpi.org/faq/?category=debugging#serial-debuggers

If you haven't done so already, I highly suggest configuring Open MPI with 
"CFLAGS=-g -O0".

As for the modex, it does actually use TCP under the covers, but that shouldn't 
matter to you: the main point is that the BTL is not used for exchanging modex 
information.  Hence, whatever your BTL module puts into the modex and gets out 
of the modex should happen asynchronously without involving the BTL.

--
Jeff Squyres
jsquy...@cisco.com


From: devel  on behalf of Michele Martinelli 
via devel 
Sent: Wednesday, August 3, 2022 12:49 PM
To: devel@lists.open-mpi.org
Cc: Michele Martinelli
Subject: Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC 
under development

thank you for the answer. Actually I think I solved that problem some
days ago, basically (if I correctly understand) MPI "adds" in some sense
an header to the data sent (please correct me if I'm wrong), which is
then used by ob1 to match the data arrived with the mpi_recv posted by
the user. The problem was then a poorly reconstructed header on the
receiving side.

unfortunately my happiness didn't last long because I have already found
another problem: it seems that the peers are not actually exchanging the
correct information via the modex protocol (not sure which kind of
network connection they are using in that phase), receiving "local" data
instead of the remote ones, but I just started debugging this, maybe I
could open a new thread specific on this.

Michele

Il 03/08/22 15:43, Jeff Squyres (jsquyres) ha scritto:
> Sorry for the huge delay in replies -- it's summer / vacation season, and I 
> think we (as a community) are a little behind in answering some of these 
> emails.  :-(
>
> It's been quite a while since I have been in the depths of BTL internals; I'm 
> afraid I don't remember the details offhand.
>
> When I was writing the usnic BTL, I know I found it useful to attach a 
> debugger on the sending and/or receiving side processes, and actually step 
> through both my BTL code and the OB1 PML code to see what was happening.  I 
> frequently found that either my BTL wasn't correctly accounting for network 
> conditions, or it wasn't passing information up to OB1 that it expected 
> (e.g., it passed the wrong length, or the wrong ID number, or ...something 
> else).  You can actually follow what happens in OB1 when your BTL invokes the 
> cbfunc -- does it find a corresponding MPI_Request, and does it mark it 
> complete?  Or does it put your incoming fragment as an unexpected message for 
> some reason, and put it on the unexpected queue?  Look for that kind of stuff.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> 
> From: devel  on behalf of Michele 
> Martinelli via devel 
> Sent: Saturday, July 23, 2022 9:04 AM
> To: devel@lists.open-mpi.org
> Cc: Michele Martinelli
> Subject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under 
> development
>
> Hi,
>
> I'm trying to develop a btl for a custom NIC. I studied the btl.h file
> to understand the flow of calls that are expected to be implemented in
> my component. I'm using a simple test (which works like a charm with the
> TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv:
>
> MPI_Init(NULL, NULL);
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> int ping_pong_count = 1;
> int partner_rank = (world_rank + 1) % 2;
> printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank);
>   if (world_rank == 0) {
> ping_pong_count++;
> MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
> MPI_COMM_WORLD);
> printf("%d sent and incremented ping_pong_count %d to %d\n",
> world_rank, ping_pong_count, partner_rank);
>   } else {
> MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
> MPI_COMM_WORLD, MPI_STATUS_IGNORE);
> printf("%d received ping_pong_count %d from %d\n",
>world_rank, ping_pong_count, partner_rank);
>   }
> MPI_Finalize();
>
> I see that in my component's btl code the functions called during the
> "MPI_send" phase are:
>
>1. mca_btl_mycomp_add_procs
>2. mca_btl_mycomp_prepare_src
>3. mca_btl_mycomp_send (where I set the return to 1, so the send phase
>   should be finished)
>
> I see then the print inside the test:
>
>   0 sent and incremented ping_pong_count 2 to 1
>
> and this should conclude the MPI_Send phase.
> Then I implemented in the btl_mycomp_component_progress function a call to:
>
>   mca_btl_active_message_callback_t *reg =
> mca_btl_base_active_message_

Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development

2022-08-03 Thread Nathan Hjelm via devel

Kind of sounds to me like they are using the wrong proc when receiving. Here is an example of what a modex receive should look like:https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/ugni/btl_ugni_endpoint.c#L44-NathanOn Aug 3, 2022, at 11:29 AM, 
"Jeff Squyres (jsquyres) via devel"  wrote:Glad you solved the first issue!With respect to debugging, if you don't have a parallel debugger, you can do something like this: 
https://www.open-mpi.org/faq/?category=debugging#serial-debuggersIf you haven't done so already, I highly suggest configuring Open MPI with "CFLAGS=-g -O0".As for the modex, it does actually use TCP under the covers, but that shouldn't matter 
to you: the main point is that the BTL is not used for exchanging modex information.  Hence, whatever your BTL module puts into the modex and gets out of the modex should happen asynchronously without involving the BTL.--Jeff 
Squyresjsquyres@cisco.comFrom: devel  on behalf of Michele Martinelli via devel Sent: Wednesday, August 3, 2022 12:49 PMTo: 
de...@lists.open-mpi.orgCc: Michele MartinelliSubject: Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under developmentthank you for the answer. Actually I think I solved that problem somedays ago, basically (if I correctly 
understand) MPI "adds" in some sensean header to the data sent (please correct me if I'm wrong), which isthen used by ob1 to match the data arrived with the mpi_recv posted bythe user. The problem was then a poorly reconstructed header on 
thereceiving side.unfortunately my happiness didn't last long because I have already foundanother problem: it seems that the peers are not actually exchanging thecorrect information via the modex protocol (not sure which kind ofnetwork connection they 
are using in that phase), receiving "local" datainstead of the remote ones, but I just started debugging this, maybe Icould open a new thread specific on this.MicheleIl 03/08/22 15:43, Jeff Squyres (jsquyres) ha scritto:Sorry for the huge 
delay in replies -- it's summer / vacation season, and I think we (as a community) are a little behind in answering some of these emails.  :-(It's been quite a while since I have been in the depths of BTL internals; I'm afraid I don't remember the 
details offhand.When I was writing the usnic BTL, I know I found it useful to attach a debugger on the sending and/or receiving side processes, and actually step through both my BTL code and the OB1 PML code to see what was happening.  I frequently 
found that either my BTL wasn't correctly accounting for network conditions, or it wasn't passing information up to OB1 that it expected (e.g., it passed the wrong length, or the wrong ID number, or ...something else).  You can actually follow what 
happens in OB1 when your BTL invokes the cbfunc -- does it find a corresponding MPI_Request, and does it mark it complete?  Or does it put your incoming fragment as an unexpected message for some reason, and put it on the unexpected queue?  Look for 
that kind of stuff.--Jeff Squyresjsquyres@cisco.comFrom: devel  on behalf of Michele Martinelli via devel Sent: Saturday, July 23, 2022 9:04 
AMTo: de...@lists.open-mpi.orgCc: Michele MartinelliSubject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under developmentHi,I'm trying to develop a btl for a custom NIC. I studied the btl.h fileto understand the flow of calls that 
are expected to be implemented inmy component. I'm using a simple test (which works like a charm with theTCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv: MPI_Init(NULL, NULL); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, 
&world_rank); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int ping_pong_count = 1; int partner_rank = (world_rank + 1) % 2; printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank); if (world_rank == 0) { 
ping_pong_count++; MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,MPI_COMM_WORLD); printf("%d sent and incremented ping_pong_count %d to %d\n",world_rank, ping_pong_count, partner_rank); } else { MPI_Recv(&ping_pong_count, 1, 
MPI_INT, partner_rank, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%d received ping_pong_count %d from %d\n", world_rank, ping_pong_count, partner_rank); } MPI_Finalize();I see that in my component's btl code the functions called during 
the"MPI_send" phase are: 1. mca_btl_mycomp_add_procs 2. mca_btl_mycomp_prepare_src 3. mca_btl_mycomp_send (where I set the return to 1, so the send phase should be finished)I see then the print inside the test: 0 sent and incremented 
ping_pong_count 2 to 1and this should conclude the MPI_Send phase.Then I implemented in the btl_mycomp_component_progress function a call to: mca_btl_active_message_callback_t *reg =mca_btl_base_active_message_trigger + tag; 
reg->cbfunc(&my_btl->super, &desc);I saw the same code in all the other BTLs and I thought this was enoughto "unl