Re: [OMPI users] Running on crashing nodes
I have have successfully used a perl program to start mpirun and record its PIDThe monitor can then watch the output from MPI and terminate the mpirun command with a series of kills or something if it is having trouble. One method of doing this is to prefix all legal output from your MPI program with a known short string, if the monitor does not see this string prefixed on a line, it can terminate MPI, check available nodes and recast the job accordingly Hope this helps,Randolph --- On Fri, 24/9/10, Joshua Hursey wrote: From: Joshua Hursey Subject: Re: [OMPI users] Running on crashing nodes To: "Open MPI Users" Received: Friday, 24 September, 2010, 10:18 PM As one of the Open MPI developers actively working on the MPI layer stabilization/recover feature set, I don't think we can give you a specific timeframe for availability, especially availability in a stable release. Once the initial functionality is finished, we will open it up for user testing by making a public branch available. After addressing the concerns highlighted by public testing, we will attempt to work this feature into the mainline trunk and eventual release. Unfortunately it is difficult to assess the time needed to go through these development stages. What I can tell you is that the work to this point on the MPI layer is looking promising, and that as soon as we feel that the code is ready we will make it available to the public for further testing. -- Josh On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote: > Ralph, could you tell us when this functionality will be available in the > stable version? A rough estimate will be fine. > > > On Fri, Sep 24, 2010 at 01:24, Ralph Castain wrote: > In a word, no. If a node crashes, OMPI will abort the currently-running job > if it had processes on that node. There is no current ability to "ride-thru" > such an event. > > That said, there is work being done to support "ride-thru". Most of that is > in the current developer's code trunk, and more is coming, but I wouldn't > consider it production-quality just yet. > > Specifically, the code that does what you specify below is done and works. It > is recovery of the MPI job itself (collectives, lost messages, etc.) that > remains to be completed. > > > On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau > wrote: > Dear users, > > Our cluster has a number of nodes which have high probability to crash, so it > happens quite often that calculations stop due to one node getting down. May > be you know if it is possible to block the crashed nodes during run-time when > running with OpenMPI? I am asking about principal possibility to program such > behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious > about is the following: > > 1. A code starts its tasks via mpirun on several nodes > 2. At some moment one node gets down > 3. The code realizes that the node is down (the results are lost) and > excludes it from the list of nodes to run its tasks on > 4. At later moment the user restarts the crashed node > 5. The code notices that the node is up again, and puts it back to the list > of active nodes > > > Regards, > Andrei > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Memory affinity
On 9/27/2010 2:50 PM, David Singleton wrote: On 09/28/2010 06:52 AM, Tim Prince wrote: On 9/27/2010 12:21 PM, Gabriele Fatigati wrote: HI Tim, I have read that link, but I haven't understood if enabling processor affinity are enabled also memory affinity because is written that: "Note that memory affinity support is enabled only when processor affinity is enabled" Can i set processory affinity without memory affinity? This is my question.. 2010/9/27 Tim Prince On 9/27/2010 9:01 AM, Gabriele Fatigati wrote: if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I didn't find memory affinity alone ( similar) parameter to set at 1. The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a useful introduction to affinity. It's available in a default build, but not enabled by default. Memory affinity is implied by processor affinity. Your system libraries are set up so as to cause any memory allocated to be made local to the processor, if possible. That's one of the primary benefits of processor affinity. Not being an expert in openmpi, I assume, in the absence of further easily accessible documentation, there's no useful explicit way to disable maffinity while using paffinity on platforms other than the specified legacy platforms. Memory allocation policy really needs to be independent of processor binding policy. The default memory policy (memory affinity) of "attempt to allocate to the NUMA node of the cpu that made the allocation request but fallback as needed" is flawed in a number of situations. This is true even when MPI jobs are given dedicated access to processors. A common one is where the local NUMA node is full of pagecache pages (from the checkpoint of the last job to complete). For those sites that support suspend/resume based scheduling, NUMA nodes will generally contain pages from suspended jobs. Ideally, the new (suspending) job should suffer a little bit of paging overhead (pushing out the suspended job) to get ideal memory placement for the next 6 or whatever hours of execution. An mbind (MPOL_BIND) policy of binding to the one local NUMA node will not work in the case of one process requiring more memory than that local NUMA node. One scenario is a master-slave where you might want: master (rank 0) bound to processor 0 but not memory bound slave (rank i) bound to processor i and memory bound to the local memory of processor i. They really are independent requirements. Cheers, David ___ interesting; I agree with those of your points on which I have enough experience to have an opinion. However, the original question was not whether it would be desirable to have independent memory affinity, but whether it is possible currently within openmpi to avoid memory placements being influenced by processor affinity. I have seen the case you mention, where performance of a long job suffers because the state of memory from a previous job results in an abnormal number of allocations falling over to other NUMA nodes, but I don't know the practical solution. -- Tim Prince
[OMPI users] mpi_in_place not working in mpi_allreduce
Dear all: I ran this simple fortran code and got unexpected result: ! program reduce implicit none include 'mpif.h' integer :: ierr, rank real*8 :: send(5) call mpi_init(ierr) call mpi_comm_rank(mpi_comm_world,rank,ierr) send = real(rank) print *, rank,':',send call mpi_allreduce(MPI_IN_PLACE,send,size(send),mpi_real8,mpi_sum,mpi_comm_world,ierr) print *, rank,'#',send call mpi_finalize(ierr) end program reduce ! When running with 3 processes mpirun -np 3 reduce The results I'm expecting is the sum of all 3 vectors, but I got the unexpected result: 0 : 0.0. 0.0.0. 2 : 2.2. 2.2.2. 1 : 1.1. 1.1.1. 0 # 0.0. 0.0.0. 1 # 0.0. 0.0.0. 2 # 0.0. 0.0.0. During compilation and running there were no errors or warnings. I install openMPI via fink. I believe somehow fink messed up during installation. Instead of installing MPI from source (which takes hours on my machine), I would like to know if there is a better than to find out what the problem is, so that I could fix my current installation rather than reinstall MPI from scratch. -- David Zhang University of California, San Diego
Re: [OMPI users] Memory affinity
On 09/28/2010 06:52 AM, Tim Prince wrote: On 9/27/2010 12:21 PM, Gabriele Fatigati wrote: HI Tim, I have read that link, but I haven't understood if enabling processor affinity are enabled also memory affinity because is written that: "Note that memory affinity support is enabled only when processor affinity is enabled" Can i set processory affinity without memory affinity? This is my question.. 2010/9/27 Tim Prince On 9/27/2010 9:01 AM, Gabriele Fatigati wrote: if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I didn't find memory affinity alone ( similar) parameter to set at 1. The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a useful introduction to affinity. It's available in a default build, but not enabled by default. Memory affinity is implied by processor affinity. Your system libraries are set up so as to cause any memory allocated to be made local to the processor, if possible. That's one of the primary benefits of processor affinity. Not being an expert in openmpi, I assume, in the absence of further easily accessible documentation, there's no useful explicit way to disable maffinity while using paffinity on platforms other than the specified legacy platforms. Memory allocation policy really needs to be independent of processor binding policy. The default memory policy (memory affinity) of "attempt to allocate to the NUMA node of the cpu that made the allocation request but fallback as needed" is flawed in a number of situations. This is true even when MPI jobs are given dedicated access to processors. A common one is where the local NUMA node is full of pagecache pages (from the checkpoint of the last job to complete). For those sites that support suspend/resume based scheduling, NUMA nodes will generally contain pages from suspended jobs. Ideally, the new (suspending) job should suffer a little bit of paging overhead (pushing out the suspended job) to get ideal memory placement for the next 6 or whatever hours of execution. An mbind (MPOL_BIND) policy of binding to the one local NUMA node will not work in the case of one process requiring more memory than that local NUMA node. One scenario is a master-slave where you might want: master (rank 0) bound to processor 0 but not memory bound slave (rank i) bound to processor i and memory bound to the local memory of processor i. They really are independent requirements. Cheers, David
Re: [OMPI users] Memory affinity
On 9/27/2010 12:21 PM, Gabriele Fatigati wrote: HI Tim, I have read that link, but I haven't understood if enabling processor affinity are enabled also memory affinity because is written that: "Note that memory affinity support is enabled only when processor affinity is enabled" Can i set processory affinity without memory affinity? This is my question.. 2010/9/27 Tim Prince On 9/27/2010 9:01 AM, Gabriele Fatigati wrote: if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I didn't find memory affinity alone ( similar) parameter to set at 1. The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a useful introduction to affinity. It's available in a default build, but not enabled by default. Memory affinity is implied by processor affinity. Your system libraries are set up so as to cause any memory allocated to be made local to the processor, if possible. That's one of the primary benefits of processor affinity. Not being an expert in openmpi, I assume, in the absence of further easily accessible documentation, there's no useful explicit way to disable maffinity while using paffinity on platforms other than the specified legacy platforms. -- Tim Prince
Re: [OMPI users] Memory affinity
HI Tim, I have read that link, but I haven't understood if enabling processor affinity are enabled also memory affinity because is written that: "Note that memory affinity support is enabled only when processor affinity is enabled" Can i set processory affinity without memory affinity? This is my question.. 2010/9/27 Tim Prince > > On 9/27/2010 9:01 AM, Gabriele Fatigati wrote: >> >> if OpenMPI is numa-compiled, memory affinity is enabled by default? Because >> I didn't find memory affinity alone ( similar) parameter to set at 1. >> >> > The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a > useful introduction to affinity. It's available in a default build, but not > enabled by default. > > If you mean something other than this, explanation is needed as part of your > question. > taskset() or numactl() might be relevant, if you require more detailed > control. > > -- > Tim Prince > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Memory affinity
On 9/27/2010 9:01 AM, Gabriele Fatigati wrote: if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I didn't find memory affinity alone ( similar) parameter to set at 1. The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a useful introduction to affinity. It's available in a default build, but not enabled by default. If you mean something other than this, explanation is needed as part of your question. taskset() or numactl() might be relevant, if you require more detailed control. -- Tim Prince
[OMPI users] error on mpiexec
Hi, I have compiled open-mpi 1.4.2 and uses them with boost-mpi. I can compile and run my first example. If I run it without mpiexec everything works fine. If I do it with mpiexec -np 1 or 2 I would get messages like: [node:05126] [[582,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 230 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ras_base_open failed --> Returned value Error (-1) instead of ORTE_SUCCESS orte_ess_set_name failed --> Returned value Error (-1) instead of ORTE_SUCCESS [node:05125] [[581,0],0] ORTE_ERROR_LOG: Error in file orted/ orted_main.c at line 325 [node:05124] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 381 [node:05124] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 143 [node:05124] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file runtime/orte_init.c at line 132 -- *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. OpenMPI and Boost are installed in a directory under /opt and set with the environmental variables. I'm using the first example at http://www.boost.org/doc/libs/1_44_0/doc/html/mpi/tutorial.html I'm not sure of the problem is the Boost call or the MPI configuration. Does anyone have some ideas for solving my problem? Thanks a lot Phil
Re: [OMPI users] Memory affinity
Sorry, memory affinity is enabled by default setting mprocessor_affinity=1 in OpenMPI-numa? 2010/9/27 Gabriele Fatigati > Dear OpenMPI users, > > if OpenMPI is numa-compiled, memory affinity is enabled by default? Because > I didn't find memory affinity alone ( similar) parameter to set at 1. > > Thanks a lot. > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
[OMPI users] Memory affinity
Dear OpenMPI users, if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I didn't find memory affinity alone ( similar) parameter to set at 1. Thanks a lot. -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] [openib] segfault when using openib btl
Ok there were no 0 value tags in your files. Are you running this with no eager RDMA? If not can you set the following options "-mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca btl_openib_flags 1". thanks, --td Eloi Gaudry wrote: Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). I'm displaying frag->hdr->tag here. Eloi On Monday 27 September 2010 16:29:12 Terry Dontje wrote: Eloi, sorry can you print out frag->hdr->tag? Unfortunately from your last email I think it will still all have non-zero values. If that ends up being the case then there must be something odd with the descriptor pointer to the fragment. --td Eloi Gaudry wrote: Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). For information, Nysal In his first message referred to ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) and in ompi/mca/btl/btl.h #define MCA_BTL_TAG_PML 0x40 Eloi On Monday 27 September 2010 14:36:59 Terry Dontje wrote: I am thinking checking the value of *frag->hdr right before the return in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. It is line 548 in the trunk https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ ope nib_endpoint.h#548 --td Eloi Gaudry wrote: Hi Terry, Do you have any patch that I could apply to be able to do so ? I'm remotely working on a cluster (with a terminal) and I cannot use any parallel debugger or sequential debugger (with a call to xterm...). I can track frag->hdr->tag value in ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE case, but this is all I can think of alone. You'll find a stacktrace (receive side) in this thread (10th or 11th message) but it might be pointless. Regards, Eloi On Monday 27 September 2010 11:43:55 Terry Dontje wrote: So it sounds like coalescing is not your issue and that the problem has something to do with the queue sizes. It would be helpful if we could detect the hdr->tag == 0 issue on the sending side and get at least a stack trace. There is something really odd going on here. --td Eloi Gaudry wrote: Hi Terry, I'm sorry to say that I might have missed a point here. I've lately been relaunching all previously failing computations with the message coalescing feature being switched off, and I saw the same hdr->tag=0 error several times, always during a collective call (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as soon as I switched to the peer queue option I was previously using (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using --mca btl_openib_use_message_coalescing 0), all computations ran flawlessly. As for the reproducer, I've already tried to write something but I haven't succeeded so far at reproducing the hdr->tag=0 issue with it. Eloi On 24/09/2010 18:37, Terry Dontje wrote: Eloi Gaudry wrote: Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error. There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Interesting, though it looks to me like the segv in ticket 2352 would have happened on the send side instead of the receive side like you have. As to what to do next it would be really nice to have some sort of reproducer that we can try and debug what is really going on. The only other thing to do without a reproducer is to inspect the code on the send side to figure out what might make it generate at 0 hdr->tag. Or maybe instrument the send side to stop when it is about ready to send a 0 hdr->tag and see if we can see how the code got there. I might have some cycles to look at this Monday. --td Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: Eloi Gaudry wrote: Terry, No, I haven't tried any other values than P,65536,256,192,128 yet. The reason
Re: [OMPI users] [openib] segfault when using openib btl
Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). I'm displaying frag->hdr->tag here. Eloi On Monday 27 September 2010 16:29:12 Terry Dontje wrote: > Eloi, sorry can you print out frag->hdr->tag? > > Unfortunately from your last email I think it will still all have > non-zero values. > If that ends up being the case then there must be something odd with the > descriptor pointer to the fragment. > > --td > > Eloi Gaudry wrote: > > Terry, > > > > Please find enclosed the requested check outputs (using -output-filename > > stdout.tag.null option). > > > > For information, Nysal In his first message referred to > > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on > > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML > > + 1) > > #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) > > #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) > > > > #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) > > > > #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) > > #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) > > #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) > > > > #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) > > > > #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) > > and in ompi/mca/btl/btl.h > > #define MCA_BTL_TAG_PML 0x40 > > > > Eloi > > > > On Monday 27 September 2010 14:36:59 Terry Dontje wrote: > >> I am thinking checking the value of *frag->hdr right before the return > >> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. > >> It is line 548 in the trunk > >> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ > >> ope nib_endpoint.h#548 > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> Do you have any patch that I could apply to be able to do so ? I'm > >>> remotely working on a cluster (with a terminal) and I cannot use any > >>> parallel debugger or sequential debugger (with a call to xterm...). I > >>> can track frag->hdr->tag value in > >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the > >>> SEND/RDMA_WRITE case, but this is all I can think of alone. > >>> > >>> You'll find a stacktrace (receive side) in this thread (10th or 11th > >>> message) but it might be pointless. > >>> > >>> Regards, > >>> Eloi > >>> > >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote: > So it sounds like coalescing is not your issue and that the problem > has something to do with the queue sizes. It would be helpful if we > could detect the hdr->tag == 0 issue on the sending side and get at > least a stack trace. There is something really odd going on here. > > --td > > Eloi Gaudry wrote: > > Hi Terry, > > > > I'm sorry to say that I might have missed a point here. > > > > I've lately been relaunching all previously failing computations with > > the message coalescing feature being switched off, and I saw the same > > hdr->tag=0 error several times, always during a collective call > > (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as > > soon as I switched to the peer queue option I was previously using > > (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using > > --mca btl_openib_use_message_coalescing 0), all computations ran > > flawlessly. > > > > As for the reproducer, I've already tried to write something but I > > haven't succeeded so far at reproducing the hdr->tag=0 issue with it. > > > > Eloi > > > > On 24/09/2010 18:37, Terry Dontje wrote: > >> Eloi Gaudry wrote: > >>> Terry, > >>> > >>> You were right, the error indeed seems to come from the message > >>> coalescing feature. If I turn it off using the "--mca > >>> btl_openib_use_message_coalescing 0", I'm not able to observe the > >>> "hdr->tag=0" error. > >>> > >>> There are some trac requests associated to very similar error > >>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they > >>> are all closed (except > >>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be > >>> related), aren't they ? What would you suggest Terry ? > >> > >> Interesting, though it looks to me like the segv in ticket 2352 > >> would have happened on the send side instead of the receive side > >> like you have. As to what to do next it would be really nice to > >> have some sort of reproducer that we can try and debug what is > >> really going on. The only other thing to do without a reproducer > >> is to inspect the code on the send side to figure out what might > >> make it generate at 0 hdr->tag. Or maybe instrument the send side > >> to stop when it is about ready to send a 0 hdr->tag and see if we
Re: [OMPI users] [openib] segfault when using openib btl
Eloi, sorry can you print out frag->hdr->tag? Unfortunately from your last email I think it will still all have non-zero values. If that ends up being the case then there must be something odd with the descriptor pointer to the fragment. --td Eloi Gaudry wrote: Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). For information, Nysal In his first message referred to ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) and in ompi/mca/btl/btl.h #define MCA_BTL_TAG_PML 0x40 Eloi On Monday 27 September 2010 14:36:59 Terry Dontje wrote: I am thinking checking the value of *frag->hdr right before the return in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. It is line 548 in the trunk https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope nib_endpoint.h#548 --td Eloi Gaudry wrote: Hi Terry, Do you have any patch that I could apply to be able to do so ? I'm remotely working on a cluster (with a terminal) and I cannot use any parallel debugger or sequential debugger (with a call to xterm...). I can track frag->hdr->tag value in ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE case, but this is all I can think of alone. You'll find a stacktrace (receive side) in this thread (10th or 11th message) but it might be pointless. Regards, Eloi On Monday 27 September 2010 11:43:55 Terry Dontje wrote: So it sounds like coalescing is not your issue and that the problem has something to do with the queue sizes. It would be helpful if we could detect the hdr->tag == 0 issue on the sending side and get at least a stack trace. There is something really odd going on here. --td Eloi Gaudry wrote: Hi Terry, I'm sorry to say that I might have missed a point here. I've lately been relaunching all previously failing computations with the message coalescing feature being switched off, and I saw the same hdr->tag=0 error several times, always during a collective call (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as soon as I switched to the peer queue option I was previously using (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using --mca btl_openib_use_message_coalescing 0), all computations ran flawlessly. As for the reproducer, I've already tried to write something but I haven't succeeded so far at reproducing the hdr->tag=0 issue with it. Eloi On 24/09/2010 18:37, Terry Dontje wrote: Eloi Gaudry wrote: Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error. There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Interesting, though it looks to me like the segv in ticket 2352 would have happened on the send side instead of the receive side like you have. As to what to do next it would be really nice to have some sort of reproducer that we can try and debug what is really going on. The only other thing to do without a reproducer is to inspect the code on the send side to figure out what might make it generate at 0 hdr->tag. Or maybe instrument the send side to stop when it is about ready to send a 0 hdr->tag and see if we can see how the code got there. I might have some cycles to look at this Monday. --td Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: Eloi Gaudry wrote: Terry, No, I haven't tried any other values than P,65536,256,192,128 yet. The reason why is quite simple. I've been reading and reading again this thread to understand the btl_openib_receive_queues meaning and I can't figure out why the default values seem to induce the hdr- tag=0 issue (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). Yeah, the size of the fragments and number of them really should not cause this issue. So I too am a little perplexed about it. Do you think that the default shared received queue pa
Re: [OMPI users] Porting Open MPI to ARM: How essential is the opal_sys_timer_get_cycles() function?
On Sep 23, 2010, at 1:24 PM, Ken Mighell wrote: > Would a hack written in C suffice? Assembly is always better, but C should be fine. If you really want to, could you write it in C and have the compiler generate optimized assembly for you. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] [openib] segfault when using openib btl
Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). For information, Nysal In his first message referred to ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) and in ompi/mca/btl/btl.h #define MCA_BTL_TAG_PML 0x40 Eloi On Monday 27 September 2010 14:36:59 Terry Dontje wrote: > I am thinking checking the value of *frag->hdr right before the return > in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. > It is line 548 in the trunk > https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope > nib_endpoint.h#548 > > --td > > Eloi Gaudry wrote: > > Hi Terry, > > > > Do you have any patch that I could apply to be able to do so ? I'm > > remotely working on a cluster (with a terminal) and I cannot use any > > parallel debugger or sequential debugger (with a call to xterm...). I > > can track frag->hdr->tag value in > > ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the > > SEND/RDMA_WRITE case, but this is all I can think of alone. > > > > You'll find a stacktrace (receive side) in this thread (10th or 11th > > message) but it might be pointless. > > > > Regards, > > Eloi > > > > On Monday 27 September 2010 11:43:55 Terry Dontje wrote: > >> So it sounds like coalescing is not your issue and that the problem has > >> something to do with the queue sizes. It would be helpful if we could > >> detect the hdr->tag == 0 issue on the sending side and get at least a > >> stack trace. There is something really odd going on here. > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> I'm sorry to say that I might have missed a point here. > >>> > >>> I've lately been relaunching all previously failing computations with > >>> the message coalescing feature being switched off, and I saw the same > >>> hdr->tag=0 error several times, always during a collective call > >>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as > >>> soon as I switched to the peer queue option I was previously using > >>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using > >>> --mca btl_openib_use_message_coalescing 0), all computations ran > >>> flawlessly. > >>> > >>> As for the reproducer, I've already tried to write something but I > >>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it. > >>> > >>> Eloi > >>> > >>> On 24/09/2010 18:37, Terry Dontje wrote: > Eloi Gaudry wrote: > > Terry, > > > > You were right, the error indeed seems to come from the message > > coalescing feature. If I turn it off using the "--mca > > btl_openib_use_message_coalescing 0", I'm not able to observe the > > "hdr->tag=0" error. > > > > There are some trac requests associated to very similar error > > (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are > > all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 > > that might be related), aren't they ? What would you suggest Terry ? > > Interesting, though it looks to me like the segv in ticket 2352 would > have happened on the send side instead of the receive side like you > have. As to what to do next it would be really nice to have some > sort of reproducer that we can try and debug what is really going > on. The only other thing to do without a reproducer is to inspect > the code on the send side to figure out what might make it generate > at 0 hdr->tag. Or maybe instrument the send side to stop when it is > about ready to send a 0 hdr->tag and see if we can see how the code > got there. > > I might have some cycles to look at this Monday. > > --td > > > Eloi > > > > On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > >> Eloi Gaudry wrote: > >>> Terry, > >>> > >>> No, I haven't tried any other values than P,65536,256,192,128 yet. > >>> > >>> The reason why is quite simple. I've been reading and reading again > >>> this thread to understand the btl_openib_receive_queues meaning and > >>> I can't figure out why the default values seem to induce the hdr- > >>> > tag=0 issue > (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). > >> > >> Yeah, the size of the fragments and number
Re: [OMPI users] [openib] segfault when using openib btl
Terry, Please find enclosed the requested check outputs (using -output-filename stdout.tag.null option). For information, Nysal In his first message referred to ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BTL_TAG_PML + 4) #define MCA_PML_OB1_HDR_TYPE_NACK (MCA_BTL_TAG_PML + 5) #define MCA_PML_OB1_HDR_TYPE_FRAG (MCA_BTL_TAG_PML + 6) #define MCA_PML_OB1_HDR_TYPE_GET (MCA_BTL_TAG_PML + 7) #define MCA_PML_OB1_HDR_TYPE_PUT (MCA_BTL_TAG_PML + 8) #define MCA_PML_OB1_HDR_TYPE_FIN (MCA_BTL_TAG_PML + 9) and in ompi/mca/btl/btl.h #define MCA_BTL_TAG_PML 0x40 Eloi On Monday 27 September 2010 14:36:59 Terry Dontje wrote: > I am thinking checking the value of *frag->hdr right before the return > in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. > It is line 548 in the trunk > https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope > nib_endpoint.h#548 > > --td > > Eloi Gaudry wrote: > > Hi Terry, > > > > Do you have any patch that I could apply to be able to do so ? I'm > > remotely working on a cluster (with a terminal) and I cannot use any > > parallel debugger or sequential debugger (with a call to xterm...). I > > can track frag->hdr->tag value in > > ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the > > SEND/RDMA_WRITE case, but this is all I can think of alone. > > > > You'll find a stacktrace (receive side) in this thread (10th or 11th > > message) but it might be pointless. > > > > Regards, > > Eloi > > > > On Monday 27 September 2010 11:43:55 Terry Dontje wrote: > >> So it sounds like coalescing is not your issue and that the problem has > >> something to do with the queue sizes. It would be helpful if we could > >> detect the hdr->tag == 0 issue on the sending side and get at least a > >> stack trace. There is something really odd going on here. > >> > >> --td > >> > >> Eloi Gaudry wrote: > >>> Hi Terry, > >>> > >>> I'm sorry to say that I might have missed a point here. > >>> > >>> I've lately been relaunching all previously failing computations with > >>> the message coalescing feature being switched off, and I saw the same > >>> hdr->tag=0 error several times, always during a collective call > >>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as > >>> soon as I switched to the peer queue option I was previously using > >>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using > >>> --mca btl_openib_use_message_coalescing 0), all computations ran > >>> flawlessly. > >>> > >>> As for the reproducer, I've already tried to write something but I > >>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it. > >>> > >>> Eloi > >>> > >>> On 24/09/2010 18:37, Terry Dontje wrote: > Eloi Gaudry wrote: > > Terry, > > > > You were right, the error indeed seems to come from the message > > coalescing feature. If I turn it off using the "--mca > > btl_openib_use_message_coalescing 0", I'm not able to observe the > > "hdr->tag=0" error. > > > > There are some trac requests associated to very similar error > > (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are > > all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 > > that might be related), aren't they ? What would you suggest Terry ? > > Interesting, though it looks to me like the segv in ticket 2352 would > have happened on the send side instead of the receive side like you > have. As to what to do next it would be really nice to have some > sort of reproducer that we can try and debug what is really going > on. The only other thing to do without a reproducer is to inspect > the code on the send side to figure out what might make it generate > at 0 hdr->tag. Or maybe instrument the send side to stop when it is > about ready to send a 0 hdr->tag and see if we can see how the code > got there. > > I might have some cycles to look at this Monday. > > --td > > > Eloi > > > > On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > >> Eloi Gaudry wrote: > >>> Terry, > >>> > >>> No, I haven't tried any other values than P,65536,256,192,128 yet. > >>> > >>> The reason why is quite simple. I've been reading and reading again > >>> this thread to understand the btl_openib_receive_queues meaning and > >>> I can't figure out why the default values seem to induce the hdr- > >>> > tag=0 issue > (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). > >> > >> Yeah, the size of the fragments and number
Re: [OMPI users] [openib] segfault when using openib btl
I am thinking checking the value of *frag->hdr right before the return in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h. It is line 548 in the trunk https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_openib_endpoint.h#548 --td Eloi Gaudry wrote: Hi Terry, Do you have any patch that I could apply to be able to do so ? I'm remotely working on a cluster (with a terminal) and I cannot use any parallel debugger or sequential debugger (with a call to xterm...). I can track frag->hdr->tag value in ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE case, but this is all I can think of alone. You'll find a stacktrace (receive side) in this thread (10th or 11th message) but it might be pointless. Regards, Eloi On Monday 27 September 2010 11:43:55 Terry Dontje wrote: So it sounds like coalescing is not your issue and that the problem has something to do with the queue sizes. It would be helpful if we could detect the hdr->tag == 0 issue on the sending side and get at least a stack trace. There is something really odd going on here. --td Eloi Gaudry wrote: Hi Terry, I'm sorry to say that I might have missed a point here. I've lately been relaunching all previously failing computations with the message coalescing feature being switched off, and I saw the same hdr->tag=0 error several times, always during a collective call (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as soon as I switched to the peer queue option I was previously using (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using --mca btl_openib_use_message_coalescing 0), all computations ran flawlessly. As for the reproducer, I've already tried to write something but I haven't succeeded so far at reproducing the hdr->tag=0 issue with it. Eloi On 24/09/2010 18:37, Terry Dontje wrote: Eloi Gaudry wrote: Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error. There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Interesting, though it looks to me like the segv in ticket 2352 would have happened on the send side instead of the receive side like you have. As to what to do next it would be really nice to have some sort of reproducer that we can try and debug what is really going on. The only other thing to do without a reproducer is to inspect the code on the send side to figure out what might make it generate at 0 hdr->tag. Or maybe instrument the send side to stop when it is about ready to send a 0 hdr->tag and see if we can see how the code got there. I might have some cycles to look at this Monday. --td Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: Eloi Gaudry wrote: Terry, No, I haven't tried any other values than P,65536,256,192,128 yet. The reason why is quite simple. I've been reading and reading again this thread to understand the btl_openib_receive_queues meaning and I can't figure out why the default values seem to induce the hdr- tag=0 issue (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). Yeah, the size of the fragments and number of them really should not cause this issue. So I too am a little perplexed about it. Do you think that the default shared received queue parameters are erroneous for this specific Mellanox card ? Any help on finding the proper parameters would actually be much appreciated. I don't necessarily think it is the queue size for a specific card but more so the handling of the queues by the BTL when using certain sizes. At least that is one gut feel I have. In my mind the tag being 0 is either something below OMPI is polluting the data fragment or OMPI's internal protocol is some how getting messed up. I can imagine (no empirical data here) the queue sizes could change how the OMPI protocol sets things up. Another thing may be the coalescing feature in the openib BTL which tries to gang multiple messages into one packet when resources are running low. I can see where changing the queue sizes might affect the coalescing. So, it might be interesting to turn off the coalescing. You can do that by setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line. If that doesn't solve the issue then obviously there must be something else going on :-). Note, the reason I am interested in this is I am seeing a similar error condition (hdr->tag == 0) on a development system. Though my failing case fails with np=8 using the connectivity test program whic
Re: [OMPI users] [openib] segfault when using openib btl
Hi Terry, Do you have any patch that I could apply to be able to do so ? I'm remotely working on a cluster (with a terminal) and I cannot use any parallel debugger or sequential debugger (with a call to xterm...). I can track frag->hdr->tag value in ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE case, but this is all I can think of alone. You'll find a stacktrace (receive side) in this thread (10th or 11th message) but it might be pointless. Regards, Eloi On Monday 27 September 2010 11:43:55 Terry Dontje wrote: > So it sounds like coalescing is not your issue and that the problem has > something to do with the queue sizes. It would be helpful if we could > detect the hdr->tag == 0 issue on the sending side and get at least a > stack trace. There is something really odd going on here. > > --td > > Eloi Gaudry wrote: > > Hi Terry, > > > > I'm sorry to say that I might have missed a point here. > > > > I've lately been relaunching all previously failing computations with > > the message coalescing feature being switched off, and I saw the same > > hdr->tag=0 error several times, always during a collective call > > (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as > > soon as I switched to the peer queue option I was previously using > > (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using > > --mca btl_openib_use_message_coalescing 0), all computations ran > > flawlessly. > > > > As for the reproducer, I've already tried to write something but I > > haven't succeeded so far at reproducing the hdr->tag=0 issue with it. > > > > Eloi > > > > On 24/09/2010 18:37, Terry Dontje wrote: > >> Eloi Gaudry wrote: > >>> Terry, > >>> > >>> You were right, the error indeed seems to come from the message > >>> coalescing feature. If I turn it off using the "--mca > >>> btl_openib_use_message_coalescing 0", I'm not able to observe the > >>> "hdr->tag=0" error. > >>> > >>> There are some trac requests associated to very similar error > >>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are > >>> all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that > >>> might be related), aren't they ? What would you suggest Terry ? > >> > >> Interesting, though it looks to me like the segv in ticket 2352 would > >> have happened on the send side instead of the receive side like you > >> have. As to what to do next it would be really nice to have some > >> sort of reproducer that we can try and debug what is really going > >> on. The only other thing to do without a reproducer is to inspect > >> the code on the send side to figure out what might make it generate > >> at 0 hdr->tag. Or maybe instrument the send side to stop when it is > >> about ready to send a 0 hdr->tag and see if we can see how the code > >> got there. > >> > >> I might have some cycles to look at this Monday. > >> > >> --td > >> > >>> Eloi > >>> > >>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote: > Eloi Gaudry wrote: > > Terry, > > > > No, I haven't tried any other values than P,65536,256,192,128 yet. > > > > The reason why is quite simple. I've been reading and reading again > > this thread to understand the btl_openib_receive_queues meaning and > > I can't figure out why the default values seem to induce the hdr- > > > >> tag=0 issue > >> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). > > Yeah, the size of the fragments and number of them really should not > cause this issue. So I too am a little perplexed about it. > > > Do you think that the default shared received queue parameters are > > erroneous for this specific Mellanox card ? Any help on finding the > > proper parameters would actually be much appreciated. > > I don't necessarily think it is the queue size for a specific card but > more so the handling of the queues by the BTL when using certain > sizes. At least that is one gut feel I have. > > In my mind the tag being 0 is either something below OMPI is polluting > the data fragment or OMPI's internal protocol is some how getting > messed up. I can imagine (no empirical data here) the queue sizes > could change how the OMPI protocol sets things up. Another thing may > be the coalescing feature in the openib BTL which tries to gang > multiple messages into one packet when resources are running low. I > can see where changing the queue sizes might affect the coalescing. > So, it might be interesting to turn off the coalescing. You can do > that by setting "--mca btl_openib_use_message_coalescing 0" in your > mpirun line. > > If that doesn't solve the issue then obviously there must be something > else going on :-). > > Note, the reason I am interested in this is I am seeing a similar > error condition (hdr->ta
Re: [OMPI users] [openib] segfault when using openib btl
So it sounds like coalescing is not your issue and that the problem has something to do with the queue sizes. It would be helpful if we could detect the hdr->tag == 0 issue on the sending side and get at least a stack trace. There is something really odd going on here. --td Eloi Gaudry wrote: Hi Terry, I'm sorry to say that I might have missed a point here. I've lately been relaunching all previously failing computations with the message coalescing feature being switched off, and I saw the same hdr->tag=0 error several times, always during a collective call (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as soon as I switched to the peer queue option I was previously using (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using --mca btl_openib_use_message_coalescing 0), all computations ran flawlessly. As for the reproducer, I've already tried to write something but I haven't succeeded so far at reproducing the hdr->tag=0 issue with it. Eloi On 24/09/2010 18:37, Terry Dontje wrote: Eloi Gaudry wrote: Terry, You were right, the error indeed seems to come from the message coalescing feature. If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to observe the "hdr->tag=0" error. There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be related), aren't they ? What would you suggest Terry ? Interesting, though it looks to me like the segv in ticket 2352 would have happened on the send side instead of the receive side like you have. As to what to do next it would be really nice to have some sort of reproducer that we can try and debug what is really going on. The only other thing to do without a reproducer is to inspect the code on the send side to figure out what might make it generate at 0 hdr->tag. Or maybe instrument the send side to stop when it is about ready to send a 0 hdr->tag and see if we can see how the code got there. I might have some cycles to look at this Monday. --td Eloi On Friday 24 September 2010 16:00:26 Terry Dontje wrote: Eloi Gaudry wrote: Terry, No, I haven't tried any other values than P,65536,256,192,128 yet. The reason why is quite simple. I've been reading and reading again this thread to understand the btl_openib_receive_queues meaning and I can't figure out why the default values seem to induce the hdr- tag=0 issue (http://www.open-mpi.org/community/lists/users/2009/01/7808.php). Yeah, the size of the fragments and number of them really should not cause this issue. So I too am a little perplexed about it. Do you think that the default shared received queue parameters are erroneous for this specific Mellanox card ? Any help on finding the proper parameters would actually be much appreciated. I don't necessarily think it is the queue size for a specific card but more so the handling of the queues by the BTL when using certain sizes. At least that is one gut feel I have. In my mind the tag being 0 is either something below OMPI is polluting the data fragment or OMPI's internal protocol is some how getting messed up. I can imagine (no empirical data here) the queue sizes could change how the OMPI protocol sets things up. Another thing may be the coalescing feature in the openib BTL which tries to gang multiple messages into one packet when resources are running low. I can see where changing the queue sizes might affect the coalescing. So, it might be interesting to turn off the coalescing. You can do that by setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line. If that doesn't solve the issue then obviously there must be something else going on :-). Note, the reason I am interested in this is I am seeing a similar error condition (hdr->tag == 0) on a development system. Though my failing case fails with np=8 using the connectivity test program which is mainly point to point and there are not a significant amount of data transfers going on either. --td Eloi On Friday 24 September 2010 14:27:07 you wrote: That is interesting. So does the number of processes affect your runs any. The times I've seen hdr->tag be 0 usually has been due to protocol issues. The tag should never be 0. Have you tried to do other receive_queue settings other than the default and the one you mention. I wonder if you did a combination of the two receive queues causes a failure or not. Something like P,128,256,192,128:P,65536,256,192,128 I am wondering if it is the first queuing definition causing the issue or possibly the SRQ defined in the default. --td Eloi Gaudry wrote: Hi Terry, The messages being send/received can be of any size, but the error seems to happen more often with small messages (as an int being broadcasted or allreduced