Re: [OMPI users] "self scheduled" work & mpi receive???
Good points...I'll see if anything can be done to speed up the master. If we can shrink the number of MPI processes without hurting overall throughput maybe I could save enough to fit another run on the freed cores. Thanks for the ideas! I was also worried about contention on the nodes since I'm running multiple MPI processes on the same multi-core box. A typical run is 120 MPI processes on 5 nodes, each with 24 cores. I may play a little with the "--bynode" parameter to see if this has any (significant) effect THANXS amb -Original Message- From: users-boun...@open-mpi.org on behalf of Richard Treumann Sent: Fri 9/24/2010 9:16 AM To: Open MPI Users Subject: Re: [OMPI users] "self scheduled" work & mpi receive??? Amb It sounds like you have more workers than you can keep fed. Workers are finishing up and requesting their next assignment but sit idle because there are so many other idle workers too. Load balance does not really matter if the choke point is the master. The work is being done as fast as the master can hand it out. Consider using fewer workers and seeing if your load balance improves and your total thruput stays the same. If you want to use all the workers you have efficiently, you need to find a way to make the master deliver assignments as fast as workers finish them. Compute processes do not care about fairness. Having half the processes busy 100% of the time and the other half idle vs. having all the processes busy 50% of the time gives the same thruput and the hard workers will not complain. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Mikael LavoieTo: Open MPI Users List-Post: users@lists.open-mpi.org Date: 09/23/2010 05:08 PM Subject: Re: [OMPI users] "self scheduled" work & mpi receive??? Sent by: users-boun...@open-mpi.org Hi Ambrose, I'm interested in you work, i have a app to convert for myself and i don't know enough the MPI structure and syntaxe to make it... So if you wanna share your app i'm interested in taking a look at it!! Thanks and have a nice day!! Mikael Lavoie 2010/9/23 Lewis, Ambrose J. Hi All: I've written an openmpi program that "self schedules" the work. The master task is in a loop chunking up an input stream and handing off jobs to worker tasks. At first the master gives the next job to the next highest rank. After all ranks have their first job, the master waits via an MPI receive call for the next free worker. The master parses out the rank from the MPI receive and sends the next job to this node. The jobs aren't all identical, so they run for slightly different durations based on the input data. When I plot a histogram of the number of jobs each worker performed, the lower mpi ranks are doing much more work than the higher ranks. For example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 2. My guess is that openmpi returns the lowest rank from the MPI Recv when I've got MPI_ANY_SOURCE set and multiple sends have happened since the last call. Is there a different Recv call to make that will spread out the data better? THANXS! amb ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users <>
Re: [OMPI users] Shared memory
I think the 'middle ground' approach can be simplified even further if the data file is in a shared device (e.g. NFS/Samba mount) that can be mounted at the same location of the file system tree on all nodes. I have never tried it, though and mmap()'ing a non-POSIX compliant file system such as Samba might have issues I am unaware of. However, I do not see why you should not be able to do this even if the file is being written to as long as you call msync() before using the mapped pages. Durga On Fri, Sep 24, 2010 at 12:31 PM, Eugene Lohwrote: > It seems to me there are two extremes. > > One is that you replicate the data for each process. This has the > disadvantage of consuming lots of memory "unnecessarily." > > Another extreme is that shared data is distributed over all processes. This > has the disadvantage of making at least some of the data less accessible, > whether in programming complexity and/or run-time performance. > > I'm not familiar with Global Arrays. I was somewhat familiar with HPF. I > think the natural thing to do with those programming models is to distribute > data over all processes, which may relieve the excessive memory consumption > you're trying to address but which may also just put you at a different > "extreme" of this spectrum. > > The middle ground I think might make most sense would be to share data only > within a node, but to replicate the data for each node. There are probably > multiple ways of doing this -- possibly even GA, I don't know. One way > might be to use one MPI process per node, with OMP multithreading within > each process|node. Or (and I thought this was the solution you were looking > for), have some idea which processes are collocal. Have one process per > node create and initialize some shared memory -- mmap, perhaps, or SysV > shared memory. Then, have its peers map the same shared memory into their > address spaces. > > You asked what source code changes would be required. It depends. If > you're going to mmap shared memory in on each node, you need to know which > processes are collocal. If you're willing to constrain how processes are > mapped to nodes, this could be easy. (E.g., "every 4 processes are > collocal".) If you want to discover dynamically at run time which are > collocal, it would be harder. The mmap stuff could be in a stand-alone > function of about a dozen lines. If the shared area is allocated as one > piece, substituting the single malloc() call with a call to your mmap > function should be simple. If you have many malloc()s you're trying to > replace, it's harder. > > Andrei Fokau wrote: > > The data are read from a file and processed before calculations begin, so I > think that mapping will not work in our case. > Global Arrays look promising indeed. As I said, we need to put just a part > of data to the shared section. John, do you (or may be other users) have an > experience of working with GA? > http://www.emsl.pnl.gov/docs/global/um/build.html > When GA runs with MPI: > MPI_Init(..) ! start MPI > GA_Initialize() ! start global arrays > MA_Init(..) ! start memory allocator > do work > GA_Terminate() ! tidy up global arrays > MPI_Finalize() ! tidy up MPI > ! exit program > On Fri, Sep 24, 2010 at 13:44, Reuti wrote: >> >> Am 24.09.2010 um 13:26 schrieb John Hearns: >> >> > On 24 September 2010 08:46, Andrei Fokau >> > wrote: >> >> We use a C-program which consumes a lot of memory per process (up to >> >> few >> >> GB), 99% of the data being the same for each process. So for us it >> >> would be >> >> quite reasonable to put that part of data in a shared memory. >> > >> > http://www.emsl.pnl.gov/docs/global/ >> > >> > Is this eny help? Apologies if I'm talking through my hat. >> >> I was also thinking of this when I read "data in a shared memory" (besides >> approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't >> this also one idea behind "High Performance Fortran" - running in parallel >> across nodes even without knowing that it's across nodes at all while >> programming and access all data like it's being local. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] [openib] segfault when using openib btl
Eloi Gaudry wrote: Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error. There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Interesting, though it looks to me like the segv in ticket 2352 would have happened on the send side instead of the receive side like you have. As to what to do next it would be really nice to have some sort of reproducer that we can try and debug what is really going on. The only other thing to do without a reproducer is to inspect the code on the send side to figure out what might make it generate at 0 hdr->tag. Or maybe instrument the send side to stop when it is about ready to send a 0 hdr->tag and see if we can see how the code got there. I might have some cycles to look at this Monday. --td Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: Eloi Gaudry wrote: Terry, No, I haven't tried any other values than P,65536,256,192,128 yet. The reason why is quite simple. I've been reading and reading again this thread to understand the btl_openib_receive_queues meaning and I can't figure out why the default values seem to induce the hdr- tag=0 issue (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). Yeah, the size of the fragments and number of them really should not cause this issue. So I too am a little perplexed about it. Do you think that the default shared received queue parameters are erroneous for this specific Mellanox card ? Any help on finding the proper parameters would actually be much appreciated. I don't necessarily think it is the queue size for a specific card but more so the handling of the queues by the BTL when using certain sizes. At least that is one gut feel I have. In my mind the tag being 0 is either something below OMPI is polluting the data fragment or OMPI's internal protocol is some how getting messed up. I can imagine (no empirical data here) the queue sizes could change how the OMPI protocol sets things up. Another thing may be the coalescing feature in the openib BTL which tries to gang multiple messages into one packet when resources are running low. I can see where changing the queue sizes might affect the coalescing. So, it might be interesting to turn off the coalescing. You can do that by setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line. If that doesn't solve the issue then obviously there must be something else going on :-). Note, the reason I am interested in this is I am seeing a similar error condition (hdr->tag == 0) on a development system. Though my failing case fails with np=8 using the connectivity test program which is mainly point to point and there are not a significant amount of data transfers going on either. --td Eloi On Friday 24 September 2010 14:27:07 you wrote: That is interesting. So does the number of processes affect your runs any. The times I've seen hdr->tag be 0 usually has been due to protocol issues. The tag should never be 0. Have you tried to do other receive_queue settings other than the default and the one you mention. I wonder if you did a combination of the two receive queues causes a failure or not. Something like P,128,256,192,128:P,65536,256,192,128 I am wondering if it is the first queuing definition causing the issue or possibly the SRQ defined in the default. --td Eloi Gaudry wrote: Hi Terry, The messages being send/received can be of any size, but the error seems to happen more often with small messages (as an int being broadcasted or allreduced). The failing communication differs from one run to another, but some spots are more likely to be failing than another. And as far as I know, there are always located next to a small message (an int being broadcasted for instance) communication. Other typical messages size are 10k but can be very much larger. I've been checking the hca being used, its' from mellanox (with vendor_part_id=26428). There is no receive_queues parameters associated to it. $ cat share/openmpi/mca-btl-openib-device-params.ini as well: [...] # A.k.a. ConnectX [Mellanox Hermon] vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 vendor_part_id = 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 use_eager_rdma = 1 mtu = 2048 max_inline_data = 128 [..] $ ompi_info --param btl openib --parsable | grep receive_queues mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128 :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
Re: [OMPI users] Shared memory
It seems to me there are two extremes. One is that you replicate the data for each process. This has the disadvantage of consuming lots of memory "unnecessarily." Another extreme is that shared data is distributed over all processes. This has the disadvantage of making at least some of the data less accessible, whether in programming complexity and/or run-time performance. I'm not familiar with Global Arrays. I was somewhat familiar with HPF. I think the natural thing to do with those programming models is to distribute data over all processes, which may relieve the excessive memory consumption you're trying to address but which may also just put you at a different "extreme" of this spectrum. The middle ground I think might make most sense would be to share data only within a node, but to replicate the data for each node. There are probably multiple ways of doing this -- possibly even GA, I don't know. One way might be to use one MPI process per node, with OMP multithreading within each process|node. Or (and I thought this was the solution you were looking for), have some idea which processes are collocal. Have one process per node create and initialize some shared memory -- mmap, perhaps, or SysV shared memory. Then, have its peers map the same shared memory into their address spaces. You asked what source code changes would be required. It depends. If you're going to mmap shared memory in on each node, you need to know which processes are collocal. If you're willing to constrain how processes are mapped to nodes, this could be easy. (E.g., "every 4 processes are collocal".) If you want to discover dynamically at run time which are collocal, it would be harder. The mmap stuff could be in a stand-alone function of about a dozen lines. If the shared area is allocated as one piece, substituting the single malloc() call with a call to your mmap function should be simple. If you have many malloc()s you're trying to replace, it's harder. Andrei Fokau wrote: The data are read from a file and processed before calculations begin, so I think that mapping will not work in our case. Global Arrays look promising indeed. As I said, we need to put just a part of data to the shared section. John, do you (or may be other users) have an experience of working with GA? http://www.emsl.pnl.gov/docs/global/um/build.html When GA runs with MPI: MPI_Init(..) ! start MPI GA_Initialize() ! start global arrays MA_Init(..) ! start memory allocator do work GA_Terminate() ! tidy up global arrays MPI_Finalize() ! tidy up MPI ! exit program On Fri, Sep 24, 2010 at 13:44, Reutiwrote: Am 24.09.2010 um 13:26 schrieb John Hearns: > On 24 September 2010 08:46, Andrei Fokau wrote: >> We use a C-program which consumes a lot of memory per process (up to few >> GB), 99% of the data being the same for each process. So for us it would be >> quite reasonable to put that part of data in a shared memory. > > http://www.emsl.pnl.gov/docs/global/ > > Is this eny help? Apologies if I'm talking through my hat. I was also thinking of this when I read "data in a shared memory" (besides approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't this also one idea behind "High Performance Fortran" - running in parallel across nodes even without knowing that it's across nodes at all while programming and access all data like it's being local.
Re: [OMPI users] [openib] segfault when using openib btl
Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error. There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > Eloi Gaudry wrote: > > Terry, > > > > No, I haven't tried any other values than P,65536,256,192,128 yet. > > > > The reason why is quite simple. I've been reading and reading again this > > thread to understand the btl_openib_receive_queues meaning and I can't > > figure out why the default values seem to induce the hdr- > > > >> tag=0 issue > >> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). > > Yeah, the size of the fragments and number of them really should not > cause this issue. So I too am a little perplexed about it. > > > Do you think that the default shared received queue parameters are > > erroneous for this specific Mellanox card ? Any help on finding the > > proper parameters would actually be much appreciated. > > I don't necessarily think it is the queue size for a specific card but > more so the handling of the queues by the BTL when using certain sizes. > At least that is one gut feel I have. > > In my mind the tag being 0 is either something below OMPI is polluting > the data fragment or OMPI's internal protocol is some how getting messed > up. I can imagine (no empirical data here) the queue sizes could change > how the OMPI protocol sets things up. Another thing may be the > coalescing feature in the openib BTL which tries to gang multiple > messages into one packet when resources are running low. I can see > where changing the queue sizes might affect the coalescing. So, it > might be interesting to turn off the coalescing. You can do that by > setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line. > > If that doesn't solve the issue then obviously there must be something > else going on :-). > > Note, the reason I am interested in this is I am seeing a similar error > condition (hdr->tag == 0) on a development system. Though my failing > case fails with np=8 using the connectivity test program which is mainly > point to point and there are not a significant amount of data transfers > going on either. > > --td > > > Eloi > > > > On Friday 24 September 2010 14:27:07 you wrote: > >> That is interesting. So does the number of processes affect your runs > >> any. The times I've seen hdr->tag be 0 usually has been due to protocol > >> issues. The tag should never be 0. Have you tried to do other > >> receive_queue settings other than the default and the one you mention. > >> > >> I wonder if you did a combination of the two receive queues causes a > >> failure or not. Something like > >> > >> P,128,256,192,128:P,65536,256,192,128 > >> > >> I am wondering if it is the first queuing definition causing the issue > >> or possibly the SRQ defined in the default. > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> The messages being send/received can be of any size, but the error > >>> seems to happen more often with small messages (as an int being > >>> broadcasted or allreduced). The failing communication differs from one > >>> run to another, but some spots are more likely to be failing than > >>> another. And as far as I know, there are always located next to a > >>> small message (an int being broadcasted for instance) communication. > >>> Other typical messages size are > >>> > 10k but can be very much larger. > >>> > >>> I've been checking the hca being used, its' from mellanox (with > >>> vendor_part_id=26428). There is no receive_queues parameters associated > >>> to it. > >>> > >>> $ cat share/openmpi/mca-btl-openib-device-params.ini as well: > >>> [...] > >>> > >>> # A.k.a. ConnectX > >>> [Mellanox Hermon] > >>> vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 > >>> vendor_part_id = > >>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 > >>> use_eager_rdma = 1 > >>> mtu = 2048 > >>> max_inline_data = 128 > >>> > >>> [..] > >>> > >>> $ ompi_info --param btl openib --parsable | grep receive_queues > >>> > >>> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128 > >>> :S ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > >>> mca:btl:openib:param:btl_openib_receive_queues:data_source:default > >>> value mca:btl:openib:param:btl_openib_receive_queues:status:writable > >>> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, > >>> comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4 > >>> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no >
Re: [OMPI users] [openib] segfault when using openib btl
Terry, No, I haven't tried any other values than P,65536,256,192,128 yet. The reason why is quite simple. I've been reading and reading again this thread to understand the btl_openib_receive_queues meaning and I can't figure out why the default values seem to induce the "hdr->tag=0" issue (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). Do you think that the default shared received queue parameters are erroneous for this specific Mellanox card ? Any help on finding the proper parameters would actually be much appreciated. Eloi On Friday 24 September 2010 14:27:07 you wrote: > That is interesting. So does the number of processes affect your runs > any. The times I've seen hdr->tag be 0 usually has been due to protocol > issues. The tag should never be 0. Have you tried to do other > receive_queue settings other than the default and the one you mention. > > I wonder if you did a combination of the two receive queues causes a > failure or not. Something like > > P,128,256,192,128:P,65536,256,192,128 > > I am wondering if it is the first queuing definition causing the issue or > possibly the SRQ defined in the default. > > --td > > Eloi Gaudry wrote: > > Hi Terry, > > > > The messages being send/received can be of any size, but the error seems > > to happen more often with small messages (as an int being broadcasted or > > allreduced). The failing communication differs from one run to another, > > but some spots are more likely to be failing than another. And as far as > > I know, there are always located next to a small message (an int being > > broadcasted for instance) communication. Other typical messages size are > > >10k but can be very much larger. > > > > I've been checking the hca being used, its' from mellanox (with > > vendor_part_id=26428). There is no receive_queues parameters associated > > to it. > > > > $ cat share/openmpi/mca-btl-openib-device-params.ini as well: > > [...] > > > > # A.k.a. ConnectX > > [Mellanox Hermon] > > vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 > > vendor_part_id = > > 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 > > use_eager_rdma = 1 > > mtu = 2048 > > max_inline_data = 128 > > > > [..] > > > > $ ompi_info --param btl openib --parsable | grep receive_queues > > > > mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S > > ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > > mca:btl:openib:param:btl_openib_receive_queues:data_source:default > > value mca:btl:openib:param:btl_openib_receive_queues:status:writable > > mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, > > comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4 > > mca:btl:openib:param:btl_openib_receive_queues:deprecated:no > > > > I was wondering if these parameters (automatically computed at openib btl > > init for what I understood) were not incorrect in some way and I plugged > > some others values: "P,65536,256,192,128" (someone on the list used that > > values when encountering a different issue) . Since that, I haven't been > > able to observe the segfault (occuring as hrd->tag = 0 in > > btl_openib_component.c:2881) yet. > > > > Eloi > > > > > > /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/ > > > > On Thursday 23 September 2010 23:33:48 Terry Dontje wrote: > >> Eloi, I am curious about your problem. Can you tell me what size of job > >> it is? Does it always fail on the same bcast, or same process? > >> > >> Eloi Gaudry wrote: > >>> Hi Nysal, > >>> > >>> Thanks for your suggestions. > >>> > >>> I'm now able to get the checksum computed and redirected to stdout, > >>> thanks (I forgot the "-mca pml_base_verbose 5" option, you were > >>> right). I haven't been able to observe the segmentation fault (with > >>> hdr->tag=0) so far (when using pml csum) but I 'll let you know when I > >>> am. > >>> > >>> I've got two others question, which may be related to the error > >>> observed: > >>> > >>> 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI > >>> somehow depends on the btl being used (i.e. if I'm using openib, may I > >>> use the same number of MPI_Comm object as with tcp) ? Is there > >>> something as MPI_COMM_MAX in OpenMPI ? > >>> > >>> 2/ the segfaults only appears during a mpi collective call, with very > >>> small message (one int is being broadcast, for instance) ; i followed > >>> the guidelines given at http://icl.cs.utk.edu/open- > >>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build > >>> of OpenMPI asserts if I use a different min-size that 255. Anyway, if I > >>> deactivate eager_rdma, the segfaults remains. Does the openib btl > >>> handle very small message differently (even with eager_rdma > >>> deactivated) than tcp ? > >> > >> Others on the list does coalescing happen with non-eager_rdma? If so > >> then that would possibly be one difference between the
Re: [OMPI users] "self scheduled" work & mpi receive???
Amb It sounds like you have more workers than you can keep fed. Workers are finishing up and requesting their next assignment but sit idle because there are so many other idle workers too. Load balance does not really matter if the choke point is the master. The work is being done as fast as the master can hand it out. Consider using fewer workers and seeing if your load balance improves and your total thruput stays the same. If you want to use all the workers you have efficiently, you need to find a way to make the master deliver assignments as fast as workers finish them. Compute processes do not care about fairness. Having half the processes busy 100% of the time and the other half idle vs. having all the processes busy 50% of the time gives the same thruput and the hard workers will not complain. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Mikael LavoieTo: Open MPI Users Date: 09/23/2010 05:08 PM Subject: Re: [OMPI users] "self scheduled" work & mpi receive??? Sent by: users-boun...@open-mpi.org Hi Ambrose, I'm interested in you work, i have a app to convert for myself and i don't know enough the MPI structure and syntaxe to make it... So if you wanna share your app i'm interested in taking a look at it!! Thanks and have a nice day!! Mikael Lavoie 2010/9/23 Lewis, Ambrose J. Hi All: I’ve written an openmpi program that “self schedules” the work. The master task is in a loop chunking up an input stream and handing off jobs to worker tasks. At first the master gives the next job to the next highest rank. After all ranks have their first job, the master waits via an MPI receive call for the next free worker. The master parses out the rank from the MPI receive and sends the next job to this node. The jobs aren’t all identical, so they run for slightly different durations based on the input data. When I plot a histogram of the number of jobs each worker performed, the lower mpi ranks are doing much more work than the higher ranks. For example, in a 120 process run, rank 1 did 32 jobs while rank 119 only did 2. My guess is that openmpi returns the lowest rank from the MPI Recv when I’ve got MPI_ANY_SOURCE set and multiple sends have happened since the last call. Is there a different Recv call to make that will spread out the data better? THANXS! amb ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Display in terminal of error message using throw std::runtime_error on distant node...
That is already an answer that make sense. I understand that it is really not a trivial issue. I have seen other recent threads about "running on crashed nodes", and that the openmpi team is working hard on it. Well we will wait and be glad to test the first versions when (I understand it will take some time) they are released. Thanks for this quick reply, Olivier 2010/9/24 Jeff Squyres> Open MPI's fault tolerance is still somewhat rudimentary; it's a complex > topic within the entire scope of MPI. There has been much research into MPI > and fault tolerance over the years; the MPI Forum itself is grappling with > terms and definitions that make sense. It's by no means a "solved" problem. > > It's unfortunately unsurprising that Open MPI may hang in the case of a > node crash. I wish that I had a better answer for you, but I don't. :-\ > > > On Sep 24, 2010, at 3:36 AM, Olivier Riff wrote: > > > Hello, > > > > My question concerns the display of error message generated by a throw > std::runtime_error("Explicit error message"). > > I am launching on a terminal an openMPI program on several machines > using: > > mpirun -v -machinefile MyMachineFile.txt MyProgram. > > I am wondering why I cannot see an error message displayed on the > terminal when one of my distant node (meaning not the node where the > terminal is used) is crashing. I was expecting that following try catch > could also generates a display in the terminal: > > try {...My code where a crash happens... } > > { > > throw std::runtime_error( "Explicit error message" ); > > } > > > > Generally, my problem is that one of the node crashes and the global > application waits forever data from this node. On the terminal, nothing is > displayed indicating that the node has crashed and generated a useful > information of the crash nature. > > > > ( I don't think these information are relevant here, but just in case: I > am using openMPI 1.4.2, on a Mandriva 2008 system ) > > > > Thanks in advance for any help/info/advice. > > > > Olivier > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Shared memory
The data are read from a file and processed before calculations begin, so I think that mapping will not work in our case. Global Arrays look promising indeed. As I said, we need to put just a part of data to the shared section. John, do you (or may be other users) have an experience of working with GA? http://www.emsl.pnl.gov/docs/global/um/build.html *When GA runs with MPI:* MPI_Init(..) ! start MPI GA_Initialize() ! start global arrays MA_Init(..) ! start memory allocator do work GA_Terminate()! tidy up global arrays MPI_Finalize()! tidy up MPI ! exit program On Fri, Sep 24, 2010 at 13:44, Reutiwrote: > Am 24.09.2010 um 13:26 schrieb John Hearns: > > > On 24 September 2010 08:46, Andrei Fokau > wrote: > >> We use a C-program which consumes a lot of memory per process (up to few > >> GB), 99% of the data being the same for each process. So for us it would > be > >> quite reasonable to put that part of data in a shared memory. > > > > http://www.emsl.pnl.gov/docs/global/ > > > > Is this eny help? Apologies if I'm talking through my hat. > > I was also thinking of this when I read "data in a shared memory" (besides > approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't > this also one idea behind "High Performance Fortran" - running in parallel > across nodes even without knowing that it's across nodes at all while > programming and access all data like it's being local. > > -- Reuti > >
Re: [OMPI users] How to know which process is running on which core?
I completely neglected to mention that you could also use hwloc (Hardware Locality), a small utility library for learning topology-kinds of things (including if you're bound, where you're bound, etc.). Hwloc is a sub-project of Open MPI: http://www.open-mpi.org/projects/hwloc/ Open MPI uses hwloc internally, but you could also link your application against hwloc and call its C functions to get information about your process' locality, etc. On Sep 24, 2010, at 8:14 AM, Jeff Squyres wrote: > On the OMPI SVN trunk, we have an "Open MPI extension" call named > OMPI_Affinity_str(). Below is an excerpt from the man page. If this is > desirable, we can probably get it into 1.5.1. > > - > > NAME > OMPI_Affinity_str - Obtain prettyprint strings of processor affinity > information for this process > > > SYNTAX > C Syntax > #include > #include > > int OMPI_Affinity_str(char ompi_bound[OMPI_AFFINITY_STRING_MAX], > char current_binding[OMPI_AFFINITY_STRING_MAX], > char exists[OMPI_AFFINITY_STRING_MAX]) > > [snip] > OUTPUT PARAMETERS > ompi_bound > A prettyprint string describing what processor(s) Open MPI > bound this process to, or a string indicating that Open MPI > did not bind this process. > > current_binding > A prettyprint string describing what processor(s) this > process is currently bound to, or a string indicating that > the process is bound to all available processors (and is > therefore considered "unbound"). > > existsA prettyprint string describing the available sockets and > cores on this host. > > > > > On Sep 23, 2010, at 11:10 PM, Ralph Castain wrote: > >> You mean via an API of some kind? Not through an MPI call, but you can do it >> (though your code will wind up OMPI-specific). Look at the OMPI source code >> in opal/mca/paffinity/paffinity.h and you'll see the necessary calls as well >> as some macros to help parse the results. >> >> Depending upon what version you are using, there may also be a function in >> opal/mca/paffinity/base/base.h to pretty-print that info for you. I believe >> it may only be in the developer's trunk right now, or it may have made it >> into the 1.5.0 release candidate >> >> >> On Thu, Sep 23, 2010 at 11:24 AM, Fernando Saez>> wrote: >> Hi all, I'm new in the list. I don't know if this post has been treated >> before. >> >> My question is: >> >> Is there a way in the OMPI library to report which process is running >> on which core in a SMP system? I need to know processor affinity for >> optimizations issues. >> >> Regards >> >> Fernando Saez >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] [openib] segfault when using openib btl
That is interesting. So does the number of processes affect your runs any. The times I've seen hdr->tag be 0 usually has been due to protocol issues. The tag should never be 0. Have you tried to do other receive_queue settings other than the default and the one you mention. I wonder if you did a combination of the two receive queues causes a failure or not. Something like P,128,256,192,128:P,65536,256,192,128 I am wondering if it is the first queuing definition causing the issue or possibly the SRQ defined in the default. --td Eloi Gaudry wrote: Hi Terry, The messages being send/received can be of any size, but the error seems to happen more often with small messages (as an int being broadcasted or allreduced). The failing communication differs from one run to another, but some spots are more likely to be failing than another. And as far as I know, there are always located next to a small message (an int being broadcasted for instance) communication. Other typical messages size are >10k but can be very much larger. I've been checking the hca being used, its' from mellanox (with vendor_part_id=26428). There is no receive_queues parameters associated to it. $ cat share/openmpi/mca-btl-openib-device-params.ini as well: [...] # A.k.a. ConnectX [Mellanox Hermon] vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 vendor_part_id = 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 use_eager_rdma = 1 mtu = 2048 max_inline_data = 128 [..] $ ompi_info --param btl openib --parsable | grep receive_queues mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 mca:btl:openib:param:btl_openib_receive_queues:data_source:default value mca:btl:openib:param:btl_openib_receive_queues:status:writable mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4 mca:btl:openib:param:btl_openib_receive_queues:deprecated:no I was wondering if these parameters (automatically computed at openib btl init for what I understood) were not incorrect in some way and I plugged some others values: "P,65536,256,192,128" (someone on the list used that values when encountering a different issue) . Since that, I haven't been able to observe the segfault (occuring as hrd->tag = 0 in btl_openib_component.c:2881) yet. Eloi /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/ On Thursday 23 September 2010 23:33:48 Terry Dontje wrote: Eloi, I am curious about your problem. Can you tell me what size of job it is? Does it always fail on the same bcast, or same process? Eloi Gaudry wrote: Hi Nysal, Thanks for your suggestions. I'm now able to get the checksum computed and redirected to stdout, thanks (I forgot the "-mca pml_base_verbose 5" option, you were right). I haven't been able to observe the segmentation fault (with hdr->tag=0) so far (when using pml csum) but I 'll let you know when I am. I've got two others question, which may be related to the error observed: 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI somehow depends on the btl being used (i.e. if I'm using openib, may I use the same number of MPI_Comm object as with tcp) ? Is there something as MPI_COMM_MAX in OpenMPI ? 2/ the segfaults only appears during a mpi collective call, with very small message (one int is being broadcast, for instance) ; i followed the guidelines given at http://icl.cs.utk.edu/open- mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build of OpenMPI asserts if I use a different min-size that 255. Anyway, if I deactivate eager_rdma, the segfaults remains. Does the openib btl handle very small message differently (even with eager_rdma deactivated) than tcp ? Others on the list does coalescing happen with non-eager_rdma? If so then that would possibly be one difference between the openib btl and tcp aside from the actual protocol used. is there a way to make sure that large messages and small messages are handled the same way ? Do you mean so they all look like eager messages? How large of messages are we talking about here 1K, 1M or 10M? --td Regards, Eloi On Friday 17 September 2010 17:57:17 Nysal Jan wrote: Hi Eloi, Create a debug build of OpenMPI (--enable-debug) and while running with the csum PML add "-mca pml_base_verbose 5" to the command line. This will print the checksum details for each fragment sent over the wire. I'm guessing it didnt catch anything because the BTL failed. The checksum verification is done in the PML, which the BTL calls via a callback function. In your case the PML callback is never called because the hdr->tag is invalid. So enabling checksum tracing also might not be of much use. Is it the first Bcast that fails or the nth Bcast and what is the message size? I'm not sure what could be the problem at this
Re: [OMPI users] Running on crashing nodes
As one of the Open MPI developers actively working on the MPI layer stabilization/recover feature set, I don't think we can give you a specific timeframe for availability, especially availability in a stable release. Once the initial functionality is finished, we will open it up for user testing by making a public branch available. After addressing the concerns highlighted by public testing, we will attempt to work this feature into the mainline trunk and eventual release. Unfortunately it is difficult to assess the time needed to go through these development stages. What I can tell you is that the work to this point on the MPI layer is looking promising, and that as soon as we feel that the code is ready we will make it available to the public for further testing. -- Josh On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote: > Ralph, could you tell us when this functionality will be available in the > stable version? A rough estimate will be fine. > > > On Fri, Sep 24, 2010 at 01:24, Ralph Castainwrote: > In a word, no. If a node crashes, OMPI will abort the currently-running job > if it had processes on that node. There is no current ability to "ride-thru" > such an event. > > That said, there is work being done to support "ride-thru". Most of that is > in the current developer's code trunk, and more is coming, but I wouldn't > consider it production-quality just yet. > > Specifically, the code that does what you specify below is done and works. It > is recovery of the MPI job itself (collectives, lost messages, etc.) that > remains to be completed. > > > On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau > wrote: > Dear users, > > Our cluster has a number of nodes which have high probability to crash, so it > happens quite often that calculations stop due to one node getting down. May > be you know if it is possible to block the crashed nodes during run-time when > running with OpenMPI? I am asking about principal possibility to program such > behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious > about is the following: > > 1. A code starts its tasks via mpirun on several nodes > 2. At some moment one node gets down > 3. The code realizes that the node is down (the results are lost) and > excludes it from the list of nodes to run its tasks on > 4. At later moment the user restarts the crashed node > 5. The code notices that the node is up again, and puts it back to the list > of active nodes > > > Regards, > Andrei > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey
Re: [OMPI users] Display in terminal of error message using throw std::runtime_error on distant node...
Open MPI's fault tolerance is still somewhat rudimentary; it's a complex topic within the entire scope of MPI. There has been much research into MPI and fault tolerance over the years; the MPI Forum itself is grappling with terms and definitions that make sense. It's by no means a "solved" problem. It's unfortunately unsurprising that Open MPI may hang in the case of a node crash. I wish that I had a better answer for you, but I don't. :-\ On Sep 24, 2010, at 3:36 AM, Olivier Riff wrote: > Hello, > > My question concerns the display of error message generated by a throw > std::runtime_error("Explicit error message"). > I am launching on a terminal an openMPI program on several machines using: > mpirun -v -machinefile MyMachineFile.txt MyProgram. > I am wondering why I cannot see an error message displayed on the terminal > when one of my distant node (meaning not the node where the terminal is used) > is crashing. I was expecting that following try catch could also generates a > display in the terminal: > try {...My code where a crash happens... } > { > throw std::runtime_error( "Explicit error message" ); > } > > Generally, my problem is that one of the node crashes and the global > application waits forever data from this node. On the terminal, nothing is > displayed indicating that the node has crashed and generated a useful > information of the crash nature. > > ( I don't think these information are relevant here, but just in case: I am > using openMPI 1.4.2, on a Mandriva 2008 system ) > > Thanks in advance for any help/info/advice. > > Olivier > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] How to know which process is running on which core?
On the OMPI SVN trunk, we have an "Open MPI extension" call named OMPI_Affinity_str(). Below is an excerpt from the man page. If this is desirable, we can probably get it into 1.5.1. - NAME OMPI_Affinity_str - Obtain prettyprint strings of processor affinity information for this process SYNTAX C Syntax #include #include int OMPI_Affinity_str(char ompi_bound[OMPI_AFFINITY_STRING_MAX], char current_binding[OMPI_AFFINITY_STRING_MAX], char exists[OMPI_AFFINITY_STRING_MAX]) [snip] OUTPUT PARAMETERS ompi_bound A prettyprint string describing what processor(s) Open MPI bound this process to, or a string indicating that Open MPI did not bind this process. current_binding A prettyprint string describing what processor(s) this process is currently bound to, or a string indicating that the process is bound to all available processors (and is therefore considered "unbound"). existsA prettyprint string describing the available sockets and cores on this host. On Sep 23, 2010, at 11:10 PM, Ralph Castain wrote: > You mean via an API of some kind? Not through an MPI call, but you can do it > (though your code will wind up OMPI-specific). Look at the OMPI source code > in opal/mca/paffinity/paffinity.h and you'll see the necessary calls as well > as some macros to help parse the results. > > Depending upon what version you are using, there may also be a function in > opal/mca/paffinity/base/base.h to pretty-print that info for you. I believe > it may only be in the developer's trunk right now, or it may have made it > into the 1.5.0 release candidate > > > On Thu, Sep 23, 2010 at 11:24 AM, Fernando Saez> wrote: > Hi all, I'm new in the list. I don't know if this post has been treated > before. > > My question is: > > Is there a way in the OMPI library to report which process is running > on which core in a SMP system? I need to know processor affinity for > optimizations issues. > > Regards > > Fernando Saez > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Shared memory
Am 24.09.2010 um 13:26 schrieb John Hearns: > On 24 September 2010 08:46, Andrei Fokauwrote: >> We use a C-program which consumes a lot of memory per process (up to few >> GB), 99% of the data being the same for each process. So for us it would be >> quite reasonable to put that part of data in a shared memory. > > http://www.emsl.pnl.gov/docs/global/ > > Is this eny help? Apologies if I'm talking through my hat. I was also thinking of this when I read "data in a shared memory" (besides approaches like http://www.kerrighed.org/wiki/index.php/Main_Page). Wasn't this also one idea behind "High Performance Fortran" - running in parallel across nodes even without knowing that it's across nodes at all while programming and access all data like it's being local. -- Reuti > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Shared memory
On 24 September 2010 08:46, Andrei Fokauwrote: > We use a C-program which consumes a lot of memory per process (up to few > GB), 99% of the data being the same for each process. So for us it would be > quite reasonable to put that part of data in a shared memory. http://www.emsl.pnl.gov/docs/global/ Is this eny help? Apologies if I'm talking through my hat.
Re: [OMPI users] Shared memory
Is the data coming from a read-only file? In that case, a better way might be to memory map that file in the root process and share the map pointer in all the slave threads. This, like shared memory, will work only for processes within a node, of course. On Fri, Sep 24, 2010 at 3:46 AM, Andrei Fokauwrote: > We use a C-program which consumes a lot of memory per process (up to few > GB), 99% of the data being the same for each process. So for us it would be > quite reasonable to put that part of data in a shared memory. > In the source code, the memory is allocated via malloc() function. What > would it require for us to change in the source code to be able to put that > repeating data in a shared memory? > The code is normally run on several nodes. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Shared memory
We use a C-program which consumes a lot of memory per process (up to few GB), 99% of the data being the same for each process. So for us it would be quite reasonable to put that part of data in a shared memory. In the source code, the memory is allocated via malloc() function. What would it require for us to change in the source code to be able to put that repeating data in a shared memory? The code is normally run on several nodes.
Re: [OMPI users] Running on crashing nodes
Ralph, could you tell us when this functionality will be available in the stable version? A rough estimate will be fine. On Fri, Sep 24, 2010 at 01:24, Ralph Castainwrote: > In a word, no. If a node crashes, OMPI will abort the currently-running job > if it had processes on that node. There is no current ability to "ride-thru" > such an event. > > That said, there is work being done to support "ride-thru". Most of that is > in the current developer's code trunk, and more is coming, but I wouldn't > consider it production-quality just yet. > > Specifically, the code that does what you specify below is done and works. > It is recovery of the MPI job itself (collectives, lost messages, etc.) that > remains to be completed. > > > On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau < > andrei.fo...@neutron.kth.se> wrote: > >> Dear users, >> >> Our cluster has a number of nodes which have high probability to crash, so >> it happens quite often that calculations stop due to one node getting down. >> May be you know if it is possible to block the crashed nodes during run-time >> when running with OpenMPI? I am asking about principal possibility to >> program such behavior. Does OpenMPI allow such dynamic checking? The scheme >> I am curious about is the following: >> >> 1. A code starts its tasks via mpirun on several nodes >> 2. At some moment one node gets down >> 3. The code realizes that the node is down (the results are lost) and >> excludes it from the list of nodes to run its tasks on >> 4. At later moment the user restarts the crashed node >> 5. The code notices that the node is up again, and puts it back to the >> list of active nodes >> >> >> Regards, >> Andrei >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Display in terminal of error message using throw std::runtime_error on distant node...
Hello, My question concerns the display of error message generated by a throw std::runtime_error("Explicit error message"). I am launching on a terminal an openMPI program on several machines using: mpirun -v -machinefile MyMachineFile.txt MyProgram. I am wondering why I cannot see an error message displayed on the terminal when one of my distant node (meaning not the node where the terminal is used) is crashing. I was expecting that following try catch could also generates a display in the terminal: try {...My code where a crash happens... } { throw std::runtime_error( "Explicit error message" ); } Generally, my problem is that one of the node crashes and the global application waits forever data from this node. On the terminal, nothing is displayed indicating that the node has crashed and generated a useful information of the crash nature. ( I don't think these information are relevant here, but just in case: I am using openMPI 1.4.2, on a Mandriva 2008 system ) Thanks in advance for any help/info/advice. Olivier
Re: [OMPI users] [openib] segfault when using openib btl
Hi Terry, The messages being send/received can be of any size, but the error seems to happen more often with small messages (as an int being broadcasted or allreduced). The failing communication differs from one run to another, but some spots are more likely to be failing than another. And as far as I know, there are always located next to a small message (an int being broadcasted for instance) communication. Other typical messages size are >10k but can be very much larger. I've been checking the hca being used, its' from mellanox (with vendor_part_id=26428). There is no receive_queues parameters associated to it. $ cat share/openmpi/mca-btl-openib-device-params.ini as well: [...] # A.k.a. ConnectX [Mellanox Hermon] vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3 vendor_part_id = 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,26488 use_eager_rdma = 1 mtu = 2048 max_inline_data = 128 [..] $ ompi_info --param btl openib --parsable | grep receive_queues mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 mca:btl:openib:param:btl_openib_receive_queues:data_source:default value mca:btl:openib:param:btl_openib_receive_queues:status:writable mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4 mca:btl:openib:param:btl_openib_receive_queues:deprecated:no I was wondering if these parameters (automatically computed at openib btl init for what I understood) were not incorrect in some way and I plugged some others values: "P,65536,256,192,128" (someone on the list used that values when encountering a different issue) . Since that, I haven't been able to observe the segfault (occuring as hrd->tag = 0 in btl_openib_component.c:2881) yet. Eloi /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/ On Thursday 23 September 2010 23:33:48 Terry Dontje wrote: > Eloi, I am curious about your problem. Can you tell me what size of job > it is? Does it always fail on the same bcast, or same process? > > Eloi Gaudry wrote: > > Hi Nysal, > > > > Thanks for your suggestions. > > > > I'm now able to get the checksum computed and redirected to stdout, > > thanks (I forgot the "-mca pml_base_verbose 5" option, you were right). > > I haven't been able to observe the segmentation fault (with hdr->tag=0) > > so far (when using pml csum) but I 'll let you know when I am. > > > > I've got two others question, which may be related to the error observed: > > > > 1/ does the maximum number of MPI_Comm that can be handled by OpenMPI > > somehow depends on the btl being used (i.e. if I'm using openib, may I > > use the same number of MPI_Comm object as with tcp) ? Is there something > > as MPI_COMM_MAX in OpenMPI ? > > > > 2/ the segfaults only appears during a mpi collective call, with very > > small message (one int is being broadcast, for instance) ; i followed > > the guidelines given at http://icl.cs.utk.edu/open- > > mpi/faq/?category=openfabrics#ib-small-message-rdma but the debug-build > > of OpenMPI asserts if I use a different min-size that 255. Anyway, if I > > deactivate eager_rdma, the segfaults remains. Does the openib btl handle > > very small message differently (even with eager_rdma deactivated) than > > tcp ? > > Others on the list does coalescing happen with non-eager_rdma? If so > then that would possibly be one difference between the openib btl and > tcp aside from the actual protocol used. > > > is there a way to make sure that large messages and small messages are > > handled the same way ? > > Do you mean so they all look like eager messages? How large of messages > are we talking about here 1K, 1M or 10M? > > --td > > > Regards, > > Eloi > > > > On Friday 17 September 2010 17:57:17 Nysal Jan wrote: > >> Hi Eloi, > >> Create a debug build of OpenMPI (--enable-debug) and while running with > >> the csum PML add "-mca pml_base_verbose 5" to the command line. This > >> will print the checksum details for each fragment sent over the wire. > >> I'm guessing it didnt catch anything because the BTL failed. The > >> checksum verification is done in the PML, which the BTL calls via a > >> callback function. In your case the PML callback is never called > >> because the hdr->tag is invalid. So enabling checksum tracing also > >> might not be of much use. Is it the first Bcast that fails or the nth > >> Bcast and what is the message size? I'm not sure what could be the > >> problem at this moment. I'm afraid you will have to debug the BTL to > >> find out more. > >> > >> --Nysal > >> > >> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudrywrote: > >>> Hi Nysal, > >>> > >>> thanks for your response. > >>> > >>> I've been unable so far to write a test case that could illustrate the > >>> hdr->tag=0 error. > >>> Actually, I'm only observing this issue when running an internode > >>>
Re: [OMPI users] How to know which process is running on which core?
You mean via an API of some kind? Not through an MPI call, but you can do it (though your code will wind up OMPI-specific). Look at the OMPI source code in opal/mca/paffinity/paffinity.h and you'll see the necessary calls as well as some macros to help parse the results. Depending upon what version you are using, there may also be a function in opal/mca/paffinity/base/base.h to pretty-print that info for you. I believe it may only be in the developer's trunk right now, or it may have made it into the 1.5.0 release candidate On Thu, Sep 23, 2010 at 11:24 AM, Fernando Saezwrote: > Hi all, I'm new in the list. I don't know if this post has been treated > before. > > My question is: > > Is there a way in the OMPI library to report which process is running > on which core in a SMP system? I need to know processor affinity for > optimizations issues. > > Regards > > Fernando Saez > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >