Re: [OMPI devel] RFC: Add ompi-top tool
This works for me. LAM had a similar tool to query daemons and find the current state of running MPI procs (although it didn't get top- like statistics of the apps). On Dec 12, 2008, at 3:20 PM, Ralph Castain wrote: WHAT: Add new tool to retrieve/monitor process stats WHY: Several of us have had user requests to provide a convenient way of obtaining reports on memory usage and other typical stats from their MPI procs. The notion was to create a tool that would allow a user to specify multiple ranks (which could be on any number of nodes), and have the tool query mpirun to get the info. This would avoid the necessity of users remotely logging into multiple nodes to run top, ps, or other stat tools - and from having to use something heavy like Totalview for such a small purpose. WHERE: Involves the following: 1. new opal framework "opal/mca/pstat" with components to support obtaining process stats from the different OS's. Note that application procs do -not- open this framework. The open/select functions are -only- in the orte_init procedures for the HNP and orteds. This is because an app would never have any reason to call this framework, so there is no reason to open it. 2. new "orte-top" tool (also avail as ompi-top) that sends the top request to the specified mpirun and prints out the returned data. No fancy screen handling - just basic printout 3. slight mods to orted_comm to receive and process the new cmd 4. added new cmd flag define to orte/mca/odls/odls_types.h 5. added new base function to orte/mca/odls/base/ odls_base_default_fns.c to lookup the specified child and call opal_pstat to get the info WHEN: I would like to do this before the holiday break, if possible, given that Sun, Cisco, and IU are all aware and supportive of this change. However, since a number of community members are tied up with the MPI Forum next week, I propose to see if there are any immediate concerns and, if so, wait until after the holiday to more thoroughly discuss them. TIMEOUT: Dec 23 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] shared-memory allocations
>> > > > On 12/12/08 8:21 PM, "Eugene Loh" wrote: > > Richard Graham wrote: > Re: [OMPI devel] shared-memory allocations The memory allocation is intended to take into account that two separate procs may be touching the same memory, so the intent is to reduce cache conflicts (false sharing) > Got it. I'm totally fine with that. Separate cachelines. > and put the memory close to the process that is using it. > Problematic concept, but ... okay, I'll read on. > When the code first went in, there was no explicit memory affinity implemented, so first-touch was relied on to get the memory in the øcorrectø location. > > Okay. > If I remember correctly, the head and the tail each are written to be a different process, and is where the pointers and counters used to manage the fifo are maintained. They need to be close to the writer, and on separate cache lines, to avoid false sharing. > Why close to the writer (versus reader)? > > Anyhow, so far as I can tell, the 2d structure ompi_fifo_t fifo[receiver][sender] is organized by receiver. That is, the main ompi_fifo_t FIFO data structures are local to receivers. > > But then, each FIFO is initialized (that is, circular buffers and associated allocations) by senders. E.g., https://svn.open-mpi.org/trac/ompi/browser/branches/v1.3/ompi/mca/btl/Smylers/bt l_sm.c?version=19785#L537 > In the call to ompi_fifo_init(), all the circular buffer (CB) data structures are allocated by the sender. On different cachelines -- even different pages -- but all by the sender. It does not make a difference who allocates it, what makes a difference is who touches it first. > > Specifically, one accesses FIFO on the receiver side then follow pointers to the senders side. Doesn't matter if you're talking head, tail, or queue. > The queue itself is accessed most often by the reader, > You mean because it's polling often, but writer writes only once? Yes - it is polling volatile memory, so has to load from memory on every read. > so it should be closer to the reader. > Are there measurements to substantiate this? Seems to me that in a cache-based system, a reader could poll on a remote location all it wanted and there'd be traffic only if the cached copy were invalidated. Conceivably, a transfer could go cache-to-cache and not hit memory at all. I tried some measurements and found no difference for any location -- close to writer, close to reader, or far from both. > I honestly donøt remember much about the wrapper ø would have to go back to the code to look at it. If we no longer allow multiple fifo per pair, the wrapper layer can go away ø it is there to manage multiple fifoøs per pair. > > There is support for multiple circular buffers per FIFO. The code is there, but I believe Gleb disabled using multiple fifo's, and added a list to hold pending messages, so now we are paying two overheads ... I could be wrong here, but am pretty sure I am not. I don't know if George has touched the code since. > As far as granularity of allocation ø it needs to be large enough to accommodate the smallest shared memory hierarchy, so I suppose in the most general case this may be the tertiary cache ? > > I don't get this. I understand how certain things should be on separate cachelines. Beyond that, we just figure out what should be local to a process and allocate all those things together. That takes us from 3*n*n allocations (and pages) to just n of them. Not sure what you point is here. The cost per process is linear in the total number of processes, so overall the cost scales as the number of procs squared. This was designed for small smp's, to reduce coordination costs between processes, and where memory costs are not large. Once can go to very simple schemes that are constant with respect to memory footprint, but then pay the cost of multiple writers to a single queue - this is what LA-MPI did. > No reason not to allocate objects that need to be associated with the same process on the same page, as long as one avoids false sharing. > Got it. > So seems like each process could have all of itøs receive fifoøs on the same page, and these could share the also with either the heads, or the tails of each queue. Yes, this makes sense. Rich > > I will propose some specifics and run them by y'all. I think I know enough to get started. Thanks for the comments. > >
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
Brian, I found a second problem with rebuilding the datatype on the remote. Originally, the displacement were wrongly computed. This is now fixed. However, the data at the end of the fence is still not correct on the remote. I can confirm that the packed message contains only 0 instead of the real value, but I couldn't figure out how these 0 got there. The pack function works correctly for the MPI_Send function, I don't see any reason not to do the same for the MPI_Put. As you're the one-sided guy in ompi, can you take a look at the MPI_Put to see why the data is incorrect? george. On Dec 11, 2008, at 19:14 , Brian Barrett wrote: I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrause Date: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
Sorry, I really won't have time to look until after Christmas. I'll put it on the to-do list, but that's as soon as it has a prayer of reaching the top. Brian On Dec 13, 2008, at 1:02 PM, George Bosilca wrote: Brian, I found a second problem with rebuilding the datatype on the remote. Originally, the displacement were wrongly computed. This is now fixed. However, the data at the end of the fence is still not correct on the remote. I can confirm that the packed message contains only 0 instead of the real value, but I couldn't figure out how these 0 got there. The pack function works correctly for the MPI_Send function, I don't see any reason not to do the same for the MPI_Put. As you're the one- sided guy in ompi, can you take a look at the MPI_Put to see why the data is incorrect? george. On Dec 11, 2008, at 19:14 , Brian Barrett wrote: I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrause Date: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
No problem-o. George -- can you please file a bug? On Dec 13, 2008, at 3:11 PM, Brian Barrett wrote: Sorry, I really won't have time to look until after Christmas. I'll put it on the to-do list, but that's as soon as it has a prayer of reaching the top. Brian On Dec 13, 2008, at 1:02 PM, George Bosilca wrote: Brian, I found a second problem with rebuilding the datatype on the remote. Originally, the displacement were wrongly computed. This is now fixed. However, the data at the end of the fence is still not correct on the remote. I can confirm that the packed message contains only 0 instead of the real value, but I couldn't figure out how these 0 got there. The pack function works correctly for the MPI_Send function, I don't see any reason not to do the same for the MPI_Put. As you're the one-sided guy in ompi, can you take a look at the MPI_Put to see why the data is incorrect? george. On Dec 11, 2008, at 19:14 , Brian Barrett wrote: I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrause Date: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ de
Re: [OMPI devel] shared-memory allocations
Richard Graham wrote: Yes - it is polling volatile memory, so has to load from memory on every read. Actually, it will poll in cache, and only load from memory when the cache coherency protocol invalidates the cache line. Volatile semantic only prevents compiler optimizations. It does not matter much where the pages are (closer to reader or receiver) on NUMAs, as long as they are equally distributed among all sockets (ie the choice is consistent). Cache prefetching is slightly more efficient on local socket, so closer to reader may be a bit better. Patrick
Re: [OMPI devel] shared-memory allocations
To expand slightly on Patrick's last comment: > Cache prefetching is slightly > more efficient on local socket, so closer to reader may be a bit better. Ideally one polls from cache, but in the event that the line is evicted the next poll after the eviction will pay a lower cost if the memory is near to the reader. -Paul Patrick Geoffray wrote: Richard Graham wrote: Yes - it is polling volatile memory, so has to load from memory on every read. Actually, it will poll in cache, and only load from memory when the cache coherency protocol invalidates the cache line. Volatile semantic only prevents compiler optimizations. It does not matter much where the pages are (closer to reader or receiver) on NUMAs, as long as they are equally distributed among all sockets (ie the choice is consistent). Cache prefetching is slightly more efficient on local socket, so closer to reader may be a bit better. Patrick ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900