Re: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)
Sorry for the delay in replies -- it's summer / vacation season, and I think we (as a community) are a little behind in answering some of these emails. :-( It's hard to say for any given machine, but a bunch of different hardware factors can come into play, such as: - L1, L2, L3 cache sizes - Cache contention - Memory controller connectivity and locality I.e., exactly which hardware resources are the memcpy()'s in question using, and how do they interact with each other? How much overhead is produced, and/or how much contention ensues when multiple requests are in flight simultaneously? For example, it may be counter-intuitive, but sometimes injecting a small amount of delay in a software pipeline can allow hardware resources to not become overwhelmed, and therefore the overall execution becomes more efficient, and therefore consume less wall-clock execution time. Hence, doing 2 x 1MB memcpy()'s (to effect a 2MB MPI_send) may actually be overall more efficient, even though the individual parts of the transaction are less efficient. This is a complete guess, and may have nothing to do with your system, but it's one of many possibilities. Another possible factor: the specific memcpy() implementation is highly relevant. It's been a few years since I've paid close attention to memcpy(), but at one time, there was significant variation in the quality of memcpy() implementations between different compilers and/or versions of libc. I don't know if this is still a factor, or whether memcpy() is pretty well optimized in most situations these days. Additionally, alignment can be an issue (although for message sizes of 2MB, I'm guessing your buffer is page-aligned, and this probably isn't an issue). All that being said, I'm not intimately familiar with the internals of XPMEM, so I don't know what userspace/kernel space mechanisms will come into play for mapping the shared memory (e.g., is it lazily mapping the shared memory?). Also, you're probably doing this already, but these kinds of things are worth mentioning: make sure your performance benchmarks are testing the right things: do warmup transfers, make sure you're not swapping, make sure all the processes and memory are pinned properly, make sure you're on an otherwise-quiet machine, ... etc. All the Usual Benchmarking Things. -- Jeff Squyres jsquy...@cisco.com From: devel on behalf of Giorgos Katevainis via devel Sent: Thursday, July 28, 2022 9:33 AM To: Open MPI Developers Cc: Giorgos Katevainis Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem) Hello all, I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which effectively causes memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The comment reads: "Maximum size to copy with a single call to memcpy. On some systems a smaller or larger number may provide better performance (default: 256k)" And I have indeed observed performance difference by adjusting it! E.g. in a simple point-to-point test, 2 MB messages do significantly better with the parameter set to 1 MB vs 2 MB. But... why? I suppose I could imagine a memcpy of larger size being more efficient, but what would cause many small ones to end up being quicker than a single large one? Might it have something to do with memcpy intrinsics and different implementation for different sizes? If someone knows what's going on under the hood and/or could direct me to any relevant resources, I would greatly appreciate it! George
Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development
Sorry for the huge delay in replies -- it's summer / vacation season, and I think we (as a community) are a little behind in answering some of these emails. :-( It's been quite a while since I have been in the depths of BTL internals; I'm afraid I don't remember the details offhand. When I was writing the usnic BTL, I know I found it useful to attach a debugger on the sending and/or receiving side processes, and actually step through both my BTL code and the OB1 PML code to see what was happening. I frequently found that either my BTL wasn't correctly accounting for network conditions, or it wasn't passing information up to OB1 that it expected (e.g., it passed the wrong length, or the wrong ID number, or ...something else). You can actually follow what happens in OB1 when your BTL invokes the cbfunc -- does it find a corresponding MPI_Request, and does it mark it complete? Or does it put your incoming fragment as an unexpected message for some reason, and put it on the unexpected queue? Look for that kind of stuff. -- Jeff Squyres jsquy...@cisco.com From: devel on behalf of Michele Martinelli via devel Sent: Saturday, July 23, 2022 9:04 AM To: devel@lists.open-mpi.org Cc: Michele Martinelli Subject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development Hi, I'm trying to develop a btl for a custom NIC. I studied the btl.h file to understand the flow of calls that are expected to be implemented in my component. I'm using a simple test (which works like a charm with the TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv: MPI_Init(NULL, NULL); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int ping_pong_count = 1; int partner_rank = (world_rank + 1) % 2; printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank); if (world_rank == 0) { ping_pong_count++; MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD); printf("%d sent and incremented ping_pong_count %d to %d\n", world_rank, ping_pong_count, partner_rank); } else { MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%d received ping_pong_count %d from %d\n", world_rank, ping_pong_count, partner_rank); } MPI_Finalize(); I see that in my component's btl code the functions called during the "MPI_send" phase are: 1. mca_btl_mycomp_add_procs 2. mca_btl_mycomp_prepare_src 3. mca_btl_mycomp_send (where I set the return to 1, so the send phase should be finished) I see then the print inside the test: 0 sent and incremented ping_pong_count 2 to 1 and this should conclude the MPI_Send phase. Then I implemented in the btl_mycomp_component_progress function a call to: mca_btl_active_message_callback_t *reg = mca_btl_base_active_message_trigger + tag; reg->cbfunc(&my_btl->super, &desc); I saw the same code in all the other BTLs and I thought this was enough to "unlock" the MPI_Recv "polling". But actually I see my test hangs, probably "waiting" for something that never happens (?). I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match function (which I suppose to be the reg->cbfunc), and it seems to get to the end of the function, actually matching my frag. So my question is: how can I say to the framework that I finished my work and so the function can return to the user application? What am I doing wrong? Is there a way to understand where and what my code is waiting for? Best
Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development
thank you for the answer. Actually I think I solved that problem some days ago, basically (if I correctly understand) MPI "adds" in some sense an header to the data sent (please correct me if I'm wrong), which is then used by ob1 to match the data arrived with the mpi_recv posted by the user. The problem was then a poorly reconstructed header on the receiving side. unfortunately my happiness didn't last long because I have already found another problem: it seems that the peers are not actually exchanging the correct information via the modex protocol (not sure which kind of network connection they are using in that phase), receiving "local" data instead of the remote ones, but I just started debugging this, maybe I could open a new thread specific on this. Michele Il 03/08/22 15:43, Jeff Squyres (jsquyres) ha scritto: Sorry for the huge delay in replies -- it's summer / vacation season, and I think we (as a community) are a little behind in answering some of these emails. :-( It's been quite a while since I have been in the depths of BTL internals; I'm afraid I don't remember the details offhand. When I was writing the usnic BTL, I know I found it useful to attach a debugger on the sending and/or receiving side processes, and actually step through both my BTL code and the OB1 PML code to see what was happening. I frequently found that either my BTL wasn't correctly accounting for network conditions, or it wasn't passing information up to OB1 that it expected (e.g., it passed the wrong length, or the wrong ID number, or ...something else). You can actually follow what happens in OB1 when your BTL invokes the cbfunc -- does it find a corresponding MPI_Request, and does it mark it complete? Or does it put your incoming fragment as an unexpected message for some reason, and put it on the unexpected queue? Look for that kind of stuff. -- Jeff Squyres jsquy...@cisco.com From: devel on behalf of Michele Martinelli via devel Sent: Saturday, July 23, 2022 9:04 AM To: devel@lists.open-mpi.org Cc: Michele Martinelli Subject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development Hi, I'm trying to develop a btl for a custom NIC. I studied the btl.h file to understand the flow of calls that are expected to be implemented in my component. I'm using a simple test (which works like a charm with the TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv: MPI_Init(NULL, NULL); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int ping_pong_count = 1; int partner_rank = (world_rank + 1) % 2; printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank); if (world_rank == 0) { ping_pong_count++; MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD); printf("%d sent and incremented ping_pong_count %d to %d\n", world_rank, ping_pong_count, partner_rank); } else { MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%d received ping_pong_count %d from %d\n", world_rank, ping_pong_count, partner_rank); } MPI_Finalize(); I see that in my component's btl code the functions called during the "MPI_send" phase are: 1. mca_btl_mycomp_add_procs 2. mca_btl_mycomp_prepare_src 3. mca_btl_mycomp_send (where I set the return to 1, so the send phase should be finished) I see then the print inside the test: 0 sent and incremented ping_pong_count 2 to 1 and this should conclude the MPI_Send phase. Then I implemented in the btl_mycomp_component_progress function a call to: mca_btl_active_message_callback_t *reg = mca_btl_base_active_message_trigger + tag; reg->cbfunc(&my_btl->super, &desc); I saw the same code in all the other BTLs and I thought this was enough to "unlock" the MPI_Recv "polling". But actually I see my test hangs, probably "waiting" for something that never happens (?). I also took a look in the ob1 mca_pml_ob1_recv_frag_callback_match function (which I suppose to be the reg->cbfunc), and it seems to get to the end of the function, actually matching my frag. So my question is: how can I say to the framework that I finished my work and so the function can return to the user application? What am I doing wrong? Is there a way to understand where and what my code is waiting for? Best
Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development
Glad you solved the first issue! With respect to debugging, if you don't have a parallel debugger, you can do something like this: https://www.open-mpi.org/faq/?category=debugging#serial-debuggers If you haven't done so already, I highly suggest configuring Open MPI with "CFLAGS=-g -O0". As for the modex, it does actually use TCP under the covers, but that shouldn't matter to you: the main point is that the BTL is not used for exchanging modex information. Hence, whatever your BTL module puts into the modex and gets out of the modex should happen asynchronously without involving the BTL. -- Jeff Squyres jsquy...@cisco.com From: devel on behalf of Michele Martinelli via devel Sent: Wednesday, August 3, 2022 12:49 PM To: devel@lists.open-mpi.org Cc: Michele Martinelli Subject: Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development thank you for the answer. Actually I think I solved that problem some days ago, basically (if I correctly understand) MPI "adds" in some sense an header to the data sent (please correct me if I'm wrong), which is then used by ob1 to match the data arrived with the mpi_recv posted by the user. The problem was then a poorly reconstructed header on the receiving side. unfortunately my happiness didn't last long because I have already found another problem: it seems that the peers are not actually exchanging the correct information via the modex protocol (not sure which kind of network connection they are using in that phase), receiving "local" data instead of the remote ones, but I just started debugging this, maybe I could open a new thread specific on this. Michele Il 03/08/22 15:43, Jeff Squyres (jsquyres) ha scritto: > Sorry for the huge delay in replies -- it's summer / vacation season, and I > think we (as a community) are a little behind in answering some of these > emails. :-( > > It's been quite a while since I have been in the depths of BTL internals; I'm > afraid I don't remember the details offhand. > > When I was writing the usnic BTL, I know I found it useful to attach a > debugger on the sending and/or receiving side processes, and actually step > through both my BTL code and the OB1 PML code to see what was happening. I > frequently found that either my BTL wasn't correctly accounting for network > conditions, or it wasn't passing information up to OB1 that it expected > (e.g., it passed the wrong length, or the wrong ID number, or ...something > else). You can actually follow what happens in OB1 when your BTL invokes the > cbfunc -- does it find a corresponding MPI_Request, and does it mark it > complete? Or does it put your incoming fragment as an unexpected message for > some reason, and put it on the unexpected queue? Look for that kind of stuff. > > -- > Jeff Squyres > jsquy...@cisco.com > > > From: devel on behalf of Michele > Martinelli via devel > Sent: Saturday, July 23, 2022 9:04 AM > To: devel@lists.open-mpi.org > Cc: Michele Martinelli > Subject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under > development > > Hi, > > I'm trying to develop a btl for a custom NIC. I studied the btl.h file > to understand the flow of calls that are expected to be implemented in > my component. I'm using a simple test (which works like a charm with the > TCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv: > > MPI_Init(NULL, NULL); > int world_rank; > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > int world_size; > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > int ping_pong_count = 1; > int partner_rank = (world_rank + 1) % 2; > printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank); > if (world_rank == 0) { > ping_pong_count++; > MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0, > MPI_COMM_WORLD); > printf("%d sent and incremented ping_pong_count %d to %d\n", > world_rank, ping_pong_count, partner_rank); > } else { > MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0, > MPI_COMM_WORLD, MPI_STATUS_IGNORE); > printf("%d received ping_pong_count %d from %d\n", >world_rank, ping_pong_count, partner_rank); > } > MPI_Finalize(); > > I see that in my component's btl code the functions called during the > "MPI_send" phase are: > >1. mca_btl_mycomp_add_procs >2. mca_btl_mycomp_prepare_src >3. mca_btl_mycomp_send (where I set the return to 1, so the send phase > should be finished) > > I see then the print inside the test: > > 0 sent and incremented ping_pong_count 2 to 1 > > and this should conclude the MPI_Send phase. > Then I implemented in the btl_mycomp_component_progress function a call to: > > mca_btl_active_message_callback_t *reg = > mca_btl_base_active_message_
Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under development
Kind of sounds to me like they are using the wrong proc when receiving. Here is an example of what a modex receive should look like:https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/ugni/btl_ugni_endpoint.c#L44-NathanOn Aug 3, 2022, at 11:29 AM, "Jeff Squyres (jsquyres) via devel" wrote:Glad you solved the first issue!With respect to debugging, if you don't have a parallel debugger, you can do something like this: https://www.open-mpi.org/faq/?category=debugging#serial-debuggersIf you haven't done so already, I highly suggest configuring Open MPI with "CFLAGS=-g -O0".As for the modex, it does actually use TCP under the covers, but that shouldn't matter to you: the main point is that the BTL is not used for exchanging modex information. Hence, whatever your BTL module puts into the modex and gets out of the modex should happen asynchronously without involving the BTL.--Jeff Squyresjsquyres@cisco.comFrom: devel on behalf of Michele Martinelli via devel Sent: Wednesday, August 3, 2022 12:49 PMTo: de...@lists.open-mpi.orgCc: Michele MartinelliSubject: Re: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under developmentthank you for the answer. Actually I think I solved that problem somedays ago, basically (if I correctly understand) MPI "adds" in some sensean header to the data sent (please correct me if I'm wrong), which isthen used by ob1 to match the data arrived with the mpi_recv posted bythe user. The problem was then a poorly reconstructed header on thereceiving side.unfortunately my happiness didn't last long because I have already foundanother problem: it seems that the peers are not actually exchanging thecorrect information via the modex protocol (not sure which kind ofnetwork connection they are using in that phase), receiving "local" datainstead of the remote ones, but I just started debugging this, maybe Icould open a new thread specific on this.MicheleIl 03/08/22 15:43, Jeff Squyres (jsquyres) ha scritto:Sorry for the huge delay in replies -- it's summer / vacation season, and I think we (as a community) are a little behind in answering some of these emails. :-(It's been quite a while since I have been in the depths of BTL internals; I'm afraid I don't remember the details offhand.When I was writing the usnic BTL, I know I found it useful to attach a debugger on the sending and/or receiving side processes, and actually step through both my BTL code and the OB1 PML code to see what was happening. I frequently found that either my BTL wasn't correctly accounting for network conditions, or it wasn't passing information up to OB1 that it expected (e.g., it passed the wrong length, or the wrong ID number, or ...something else). You can actually follow what happens in OB1 when your BTL invokes the cbfunc -- does it find a corresponding MPI_Request, and does it mark it complete? Or does it put your incoming fragment as an unexpected message for some reason, and put it on the unexpected queue? Look for that kind of stuff.--Jeff Squyresjsquyres@cisco.comFrom: devel on behalf of Michele Martinelli via devel Sent: Saturday, July 23, 2022 9:04 AMTo: de...@lists.open-mpi.orgCc: Michele MartinelliSubject: [OMPI devel] How to progress MPI_Recv using custom BTL for NIC under developmentHi,I'm trying to develop a btl for a custom NIC. I studied the btl.h fileto understand the flow of calls that are expected to be implemented inmy component. I'm using a simple test (which works like a charm with theTCP btl) to test my development, the code is a simple MPI_Send + MPI_Recv: MPI_Init(NULL, NULL); int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); int ping_pong_count = 1; int partner_rank = (world_rank + 1) % 2; printf("MY RANK: %d PARTNER: %d\n",world_rank,partner_rank); if (world_rank == 0) { ping_pong_count++; MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,MPI_COMM_WORLD); printf("%d sent and incremented ping_pong_count %d to %d\n",world_rank, ping_pong_count, partner_rank); } else { MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("%d received ping_pong_count %d from %d\n", world_rank, ping_pong_count, partner_rank); } MPI_Finalize();I see that in my component's btl code the functions called during the"MPI_send" phase are: 1. mca_btl_mycomp_add_procs 2. mca_btl_mycomp_prepare_src 3. mca_btl_mycomp_send (where I set the return to 1, so the send phase should be finished)I see then the print inside the test: 0 sent and incremented ping_pong_count 2 to 1and this should conclude the MPI_Send phase.Then I implemented in the btl_mycomp_component_progress function a call to: mca_btl_active_message_callback_t *reg =mca_btl_base_active_message_trigger + tag; reg->cbfunc(&my_btl->super, &desc);I saw the same code in all the other BTLs and I thought this was enoughto "unl