Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Terry Dontje
In some of the testing Eloi did earlier he did disabled eager rdma and still saw the issue. --td Shamis, Pavel wrote: Terry, Ishai Rabinovitz is HPC team manager (I added him to CC) Eloi, Back to issue. I have seen very similar issue long time ago on some hardware platforms that support

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Eloi Gaudry
Pasha, Thanks for your help. I'm not aware of such memory configuration on the new cluster of our customer (each computing node is running the Red-Hat 5.x operating system on Intel X5570 processors). Anyway, I've already tried to deactivate eager_rdma, but this wouldn't solve the hdr->tag=0

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Shamis, Pavel
Terry, Ishai Rabinovitz is HPC team manager (I added him to CC) Eloi, Back to issue. I have seen very similar issue long time ago on some hardware platforms that support relaxed ordering memory operations. If I remember correct it was some IBM platform. Do you know if relaxed memory ordering

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Terry Dontje
Pasha, do you by any chance know who at Mellanox might be responsible for OMPI working? --td Eloi Gaudry wrote: Hi Nysal, Terry, Thanks for your input on this issue. I'll follow your advice. Do you know any Mellanox developer I may discuss with, preferably someone who has spent some time

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Eloi Gaudry
Hi Nysal, Terry, Thanks for your input on this issue. I'll follow your advice. Do you know any Mellanox developer I may discuss with, preferably someone who has spent some time inside the openib btl ? Regards, Eloi On 29/09/2010 06:01, Nysal Jan wrote: Hi Eloi, We discussed this issue

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Nysal Jan
; > >>>>>>>>>>>>> Hi Nysal, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks for your suggestions. > > >>>>>>>>>>>>> > >

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 ------ Forwarded message -- From: Eloi Gaudry <e...@fft.be> To: Open MPI Users <us...@open-mpi.org>

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
>>>>>>>>> option, you were right). I haven't been able to observe the > >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml > >>>>>>>>>>>>> csum) but I 'll let

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded message ------ From: Eloi Gaudry <e...@fft.be> To: Open MPI Users <us...@open-mpi.org> Date: Wed, 15 Sep 2010 16:

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
;>>>>> instance) ; i followed the guidelines given at > >>>>>>>>>>> http://icl.cs.utk.edu/open- > >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the > >>>>&g

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
;>>>>> instance) ; i followed the guidelines given at > >>>>>>>>>>> http://icl.cs.utk.edu/open- > >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the > >>>>&g

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje
Eloi Gaudry <e...@fft.be> To: Open MPI Users <us...@open-mpi.org> Date: Wed, 15 Sep 2010 16:27:43 +0200 Subject: Re: [OMPI users] [openib] segfault when using openib btl Hi, I was wondering if anybody got a chance to have a look at this issue. Regards, Eloi On Wednesday 18 August 2010 0

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
t;>>>>>> ConnectX IB DDR, PCIe 2.0 > >>>>>>> 2.5GT/s, rev a0) with our time-domain software. > >>>>>>> > >>>>>>> I checked, double-checked, and rechecked again every MPI use > >>>>>>> perf

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
. > >>>>> > >>>>> I've just used the "-mca pml csum" option and I haven't seen any > >>>>> related messages (when hdr->tag=0 and the segfaults occurs). > >>>>> Any suggestion ? > >>>>> > >>>>> Regards, > >>&

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje
at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded message -- From: Eloi Gaudry <e...@fft.be> To: Open MPI Users <us...@open-mpi.org> Date: We

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry
y it at the receiver side for any data corruption. You can try > >>>> using it to see if it is able > >>> > >>> to > >>> > >>>> catch anything. > >>>> > >>>> Regards > >>>> --Nysal > &g

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-23 Thread Terry Dontje
.@fft.be> To: Open MPI Users <us...@open-mpi.org> Date: Wed, 15 Sep 2010 16:27:43 +0200 Subject: Re: [OMPI users] [openib] segfault when using openib btl Hi, I was wondering if anybody got a chance to have a look at this issue. Regards, Eloi On Wednesday 18 August 2010 09:16

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-22 Thread Eloi Gaudry
> > > On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote: > > > > Hi Nysal, > > > > > > > > I'm sorry to intrrupt, but I was wondering if you had a chance to > > > > look > > > > at > > > > > > this

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Eloi Gaudry
Regards, > > Eloi > > > > > > > > -- > > > > > > Eloi Gaudry > > > > Free Field Technologies > > Company Website: http://www.fft.be > > Company Phone: +32 10 487 959 > > > > > > -- Forwarded

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Nysal Jan
fft.be > Company Phone: +32 10 487 959 > > > -- Forwarded message -- > From: Eloi Gaudry <e...@fft.be> > To: Open MPI Users <us...@open-mpi.org> > Date: Wed, 15 Sep 2010 16:27:43 +0200 > Subject: Re: [OMPI users] [openib] segfault when using openib

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-15 Thread Eloi Gaudry
Hi, I was wondering if anybody got a chance to have a look at this issue. Regards, Eloi On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: > Hi Jeff, > > Please find enclosed the output (valgrind.out.gz) from > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl >

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-20 Thread Eloi Gaudry
Hi Jeff, here is the valgrind output when using OpenMPI -1.5rc5, just in case. Thanks, Eloi On Wednesday 18 August 2010 23:01:49 Jeff Squyres wrote: > On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote: > > would it help if i use the upcoming 1.5 version of openmpi ? i read that > > a huge effort

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-18 Thread Jeff Squyres
On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote: > would it help if i use the upcoming 1.5 version of openmpi ? i read that a > huge effort has been done to clean-up the valgrind output ? but maybe that > this doesn't > concern this btl (for the reasons you mentionned). I do not believe that

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-18 Thread Eloi Gaudry
Hi Jeff, Please find enclosed the output (valgrind.out.gz) from /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 -tag-output /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry
Hi Nysal, There is only one thread invoking MPI functions in our applications. Others threads are related to flexlm protection routines and some self-diagnostics routines that don't use any MPI functions. I built a version of our application, just ot be sure, without any other thread that the

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Nysal Jan
Hi Eloi, >Do you think that a thread race condition could explain the hdr->tag value ? Are there multiple threads invoking MPI functions in your application? The openib BTL is not yet thread safe in the 1.4 release series. There have been improvements to openib BTL thread safety in 1.5, but it is

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry
Hi Nysal, This is what I was wondering, it hdr->tag was expected to be null or not. I'll soon send a valgrind output to the list, hoping this could help to locate an invalid memory access allowing to understand why reg->cbfunc / hdr->tag are null. Do you think that a thread race condition

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry
On Monday 16 August 2010 19:14:47 Jeff Squyres wrote: > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > > I did run our application through valgrind but it couldn't find any > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2 > > with the suppression file), "Use of

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Nysal Jan
The value of hdr->tag seems wrong. In ompi/mca/pml/ob1/pml_ob1_hdr.h #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Eloi Gaudry
Hi Jeff, Thanks for your reply. I did run our application through valgrind but it couldn't find any "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2 with the suppression file), "Use of uninitialized bytes" and "Conditional jump depending on uninitialized bytes" in

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Jeff Squyres
Sorry for the delay in replying. Odd; the values of the callback function pointer should never be 0. This seems to suggest some kind of memory corruption is occurring. I don't know if it's possible, because the stack trace looks like you're calling through python, but can you run this

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-10 Thread Eloi Gaudry
Hi, sorry, i just forgot to add the values of the function parameters: (gdb) print reg->cbdata $1 = (void *) 0x0 (gdb) print openib_btl->super $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, btl_rdma_pipeline_send_length =

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-10 Thread Eloi Gaudry
Hi, Here is the output of a core file generated during a segmentation fault observed during a collective call (using openib): #0 0x in ?? () (gdb) where #0 0x in ?? () #1 0x2aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0, ep=0x1908a1c0,

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-16 Thread Eloi Gaudry
Hi Edgar, The only difference I could observed was that the segmentation fault appeared sometimes later during the parallel computation. I'm running out of idea here. I wish I could use the "--mca coll tuned" with "--mca self,sm,tcp" so that I could check that the issue is not somehow limited

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel
On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > hi edgar, > > thanks for the tips, I'm gonna try this option as well. the segmentation > fault i'm observing always happened during a collective communication > indeed... > does it basically switch all collective communication to basic mode, right ? >

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry
hi edgar, thanks for the tips, I'm gonna try this option as well. the segmentation fault i'm observing always happened during a collective communication indeed... does it basically switch all collective communication to basic mode, right ? sorry for my ignorance, but what's a NCA ? thanks,

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel
you could try first to use the algorithms in the basic module, e.g. mpirun -np x --mca coll basic ./mytest and see whether this makes a difference. I used to observe sometimes a (similar ?) problem in the openib btl triggered from the tuned collective component, in cases where the ofed libraries

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry
hi Rolf, unfortunately, i couldn't get rid of that annoying segmentation fault when selecting another bcast algorithm. i'm now going to replace MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and see if that helps. regards, éloi On Wednesday 14 July 2010 10:59:53 Eloi

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-14 Thread Eloi Gaudry
Hi Rolf, thanks for your input. You're right, I miss the coll_tuned_use_dynamic_rules option. I'll check if I the segmentation fault disappears when using the basic bcast linear algorithm using the proper command line you provided. Regards, Eloi On Tuesday 13 July 2010 20:39:59 Rolf

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Rolf vandeVaart
Hi Eloi: To select the different bcast algorithms, you need to add an extra mca parameter that tells the library to use dynamic selection. --mca coll_tuned_use_dynamic_rules 1 One way to make sure you are typing this in correctly is to use it with ompi_info. Do the following: ompi_info -mca

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Eloi Gaudry
Hi, I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the basic linear algorithm. Anyway whatever the algorithm used, the segmentation fault remains. Does anyone could give some advice on ways to diagnose the issue I'm facing ? Regards, Eloi On Monday 12 July 2010

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-12 Thread Eloi Gaudry
Hi, I'm focusing on the MPI_Bcast routine that seems to randomly segfault when using the openib btl. I'd like to know if there is any way to make OpenMPI switch to a different algorithm than the default one being selected for MPI_Bcast. Thanks for your help, Eloi On Friday 02 July 2010

[OMPI users] [openib] segfault when using openib btl

2010-07-02 Thread Eloi Gaudry
Hi, I'm observing a random segmentation fault during an internode parallel computation involving the openib btl and OpenMPI-1.4.2 (the same issue can be observed with OpenMPI-1.3.3). mpirun (Open MPI) 1.4.2 Report bugs to http://www.open-mpi.org/community/help/ [pbn08:02624] *** Process