subject:"\[OMPI users\] \[openib\] segfault when using openib btl"

Re: [OMPI users] [openib] segfault when using openib btl

2012-01-31 Thread Eloi Gaudry

>>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be handled by > >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used (i.e. if I'm > >>>>>>>>>>>>

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Terry Dontje

In some of the testing Eloi did earlier he did disabled eager rdma and still saw the issue. --td Shamis, Pavel wrote: Terry, Ishai Rabinovitz is HPC team manager (I added him to CC) Eloi, Back to issue. I have seen very similar issue long time ago on some hardware platforms that support rel

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Eloi Gaudry

Pasha, Thanks for your help. I'm not aware of such memory configuration on the new cluster of our customer (each computing node is running the Red-Hat 5.x operating system on Intel X5570 processors). Anyway, I've already tried to deactivate eager_rdma, but this wouldn't solve the hdr->tag=0 i

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Shamis, Pavel

Terry, Ishai Rabinovitz is HPC team manager (I added him to CC) Eloi, Back to issue. I have seen very similar issue long time ago on some hardware platforms that support relaxed ordering memory operations. If I remember correct it was some IBM platform. Do you know if relaxed memory ordering is

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Terry Dontje

Pasha, do you by any chance know who at Mellanox might be responsible for OMPI working? --td Eloi Gaudry wrote: Hi Nysal, Terry, Thanks for your input on this issue. I'll follow your advice. Do you know any Mellanox developer I may discuss with, preferably someone who has spent some time ins

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Eloi Gaudry

Hi Nysal, Terry, Thanks for your input on this issue. I'll follow your advice. Do you know any Mellanox developer I may discuss with, preferably someone who has spent some time inside the openib btl ? Regards, Eloi On 29/09/2010 06:01, Nysal Jan wrote: Hi Eloi, We discussed this issue durin

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-29 Thread Nysal Jan

;>>>>>>> Eloi, I am curious about your problem. Can you tell me what > > >>>>>>>>>>>> size of job it is? Does it always fail on the same bcast, > or > > >>>>>>>>>>>> same process? > > >>>>>>>>>>>> >

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje

sal, I'm sorry to intrrupt, but I was wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 48

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry

t;>> stdout, thanks (I forgot the "-mca pml_base_verbose 5" > >>>>>>>>>>>>> option, you were right). I haven't been able to observe the > >>>>>>>>>>>>> segmentati

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje

to intrrupt, but I was wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded messa

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry

call, > >>>>>>>>>>> with very small message (one int is being broadcast, for > >>>>>>>>>>> instance) ; i followed the guidelines given at > >>>>>>>>>>> http://icl.cs.utk.edu/open- > >>>>>>>>>

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry

call, > >>>>>>>>>>> with very small message (one int is being broadcast, for > >>>>>>>>>>> instance) ; i followed the guidelines given at > >>>>>>>>>>> http://icl.cs.utk.edu/open- > >>>>>>>>>

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje

catch anything. Regards --Nysal On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry wrote: Hi Nysal, I'm sorry to intrrupt, but I was wondering if you had a chance to look at this error.

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry

> >>>>>>>> btl and tcp aside from the actual protocol used. > >>>>>>>> > >>>>>>>>> is there a way to make sure that large messages and small > >>>>>>>>> messages are handled the same

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje

as wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded message -- From: Eloi Gau

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-25 Thread Eloi Gaudry

Nysal, I'm sorry to intrrupt, but I was wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website:http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded message -

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje

--Nysal On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry wrote: Hi Nysal, I'm sorry to intrrupt, but I was wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website:

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry

>> > >>>>>>> I've been unable so far to write a test case that could illustrate > >>>>>>> the hdr->tag=0 error. > >>>>>>> Actually, I'm only observing this issue when running an internode >

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry

btl or at the hardware/driver level. > >>>>> > >>>>> I've just used the "-mca pml csum" option and I haven't seen any > >>>>> related messages (when hdr->tag=0 and the segfaults occurs). > >>>>> Any suggestion ?

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Terry Dontje

rrupt, but I was wondering if you had a chance to look at this error. Regards, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959 -- Forwarded message -- From: Eloi Gaudry To: Open MPI User

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-24 Thread Eloi Gaudry

out that it will be difficult to nail down the cause. Just to > >>>> clarify, I do not work for an iwarp vendor. I can certainly try to > >>>> reproduce it on an IB system. There is also a PML called csum, you can > >>>> use it via "-mca pml csum", w

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-23 Thread Terry Dontje

ssage -- From: Eloi Gaudry To: Open MPI Users Date: Wed, 15 Sep 2010 16:27:43 +0200 Subject: Re: [OMPI users] [openib] segfault when using openib btl Hi, I was wondering if anybody got a chance to have a look at this issue. Regards, Eloi On Wednesday 18 August 2010 09:16:26 Eloi Gaudry

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-22 Thread Eloi Gaudry

rds > > > --Nysal > > > > > > On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry wrote: > > > > Hi Nysal, > > > > > > > > I'm sorry to intrrupt, but I was wondering if you had a chance to > > > > look > > >

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Eloi Gaudry

> > > > Regards, > > Eloi > > > > > > > > -- > > > > > > Eloi Gaudry > > > > Free Field Technologies > > Company Website: http://www.fft.be > > Company Phone: +32 10 487 959 > > > > > > -

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-17 Thread Nysal Jan

be > Company Phone: +32 10 487 959 > > > -- Forwarded message -- > From: Eloi Gaudry > To: Open MPI Users > Date: Wed, 15 Sep 2010 16:27:43 +0200 > Subject: Re: [OMPI users] [openib] segfault when using openib btl > Hi, > > I was wondering if anybody g

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-15 Thread Eloi Gaudry

Hi, I was wondering if anybody got a chance to have a look at this issue. Regards, Eloi On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote: > Hi Jeff, > > Please find enclosed the output (valgrind.out.gz) from > /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl > open

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-20 Thread Eloi Gaudry

Hi Jeff, here is the valgrind output when using OpenMPI -1.5rc5, just in case. Thanks, Eloi On Wednesday 18 August 2010 23:01:49 Jeff Squyres wrote: > On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote: > > would it help if i use the upcoming 1.5 version of openmpi ? i read that > > a huge effort

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-18 Thread Jeff Squyres

On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote: > would it help if i use the upcoming 1.5 version of openmpi ? i read that a > huge effort has been done to clean-up the valgrind output ? but maybe that > this doesn't > concern this btl (for the reasons you mentionned). I do not believe that

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-18 Thread Eloi Gaudry

Hi Jeff, Please find enclosed the output (valgrind.out.gz) from /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 -tag-output /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry

Hi Nysal, There is only one thread invoking MPI functions in our applications. Others threads are related to flexlm protection routines and some self-diagnostics routines that don't use any MPI functions. I built a version of our application, just ot be sure, without any other thread that the

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Nysal Jan

Hi Eloi, >Do you think that a thread race condition could explain the hdr->tag value ? Are there multiple threads invoking MPI functions in your application? The openib BTL is not yet thread safe in the 1.4 release series. There have been improvements to openib BTL thread safety in 1.5, but it is s

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry

Hi Nysal, This is what I was wondering, it hdr->tag was expected to be null or not. I'll soon send a valgrind output to the list, hoping this could help to locate an invalid memory access allowing to understand why reg->cbfunc / hdr->tag are null. Do you think that a thread race condition coul

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-17 Thread Eloi Gaudry

On Monday 16 August 2010 19:14:47 Jeff Squyres wrote: > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > > I did run our application through valgrind but it couldn't find any > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2 > > with the suppression file), "Use of uninitial

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Nysal Jan

The value of hdr->tag seems wrong. In ompi/mca/pml/ob1/pml_ob1_hdr.h #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1) #define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2) #define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3) #define MCA_PML_OB1_HDR_TYPE_ACK (MCA_BT

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Jeff Squyres

On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote: > I did run our application through valgrind but it couldn't find any "Invalid > write": there is a bunch of "Invalid read" (I'm using 1.4.2 with the > suppression file), "Use of uninitialized bytes" and "Conditional jump > depending on uninitial

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Eloi Gaudry

Hi Jeff, Thanks for your reply. I did run our application through valgrind but it couldn't find any "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2 with the suppression file), "Use of uninitialized bytes" and "Conditional jump depending on uninitialized bytes" in differe

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-16 Thread Jeff Squyres

Sorry for the delay in replying. Odd; the values of the callback function pointer should never be 0. This seems to suggest some kind of memory corruption is occurring. I don't know if it's possible, because the stack trace looks like you're calling through python, but can you run this applicat

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-10 Thread Eloi Gaudry

Hi, sorry, i just forgot to add the values of the function parameters: (gdb) print reg->cbdata $1 = (void *) 0x0 (gdb) print openib_btl->super $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288, btl_rndv_eager_limit = 12288, btl_max_send_size = 65536, btl_rdma_pipeline_send_length = 1

Re: [OMPI users] [openib] segfault when using openib btl

2010-08-10 Thread Eloi Gaudry

Hi, Here is the output of a core file generated during a segmentation fault observed during a collective call (using openib): #0 0x in ?? () (gdb) where #0 0x in ?? () #1 0x2aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0, ep=0x1908a1c0, f

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-16 Thread Eloi Gaudry

Hi Edgar, The only difference I could observed was that the segmentation fault appeared sometimes later during the parallel computation. I'm running out of idea here. I wish I could use the "--mca coll tuned" with "--mca self,sm,tcp" so that I could check that the issue is not somehow limited

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel

On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > hi edgar, > > thanks for the tips, I'm gonna try this option as well. the segmentation > fault i'm observing always happened during a collective communication > indeed... > does it basically switch all collective communication to basic mode, right ? >

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry

hi edgar, thanks for the tips, I'm gonna try this option as well. the segmentation fault i'm observing always happened during a collective communication indeed... does it basically switch all collective communication to basic mode, right ? sorry for my ignorance, but what's a NCA ? thanks, élo

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Edgar Gabriel

you could try first to use the algorithms in the basic module, e.g. mpirun -np x --mca coll basic ./mytest and see whether this makes a difference. I used to observe sometimes a (similar ?) problem in the openib btl triggered from the tuned collective component, in cases where the ofed libraries

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-15 Thread Eloi Gaudry

hi Rolf, unfortunately, i couldn't get rid of that annoying segmentation fault when selecting another bcast algorithm. i'm now going to replace MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and see if that helps. regards, éloi On Wednesday 14 July 2010 10:59:53 Eloi Gaud

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-14 Thread Eloi Gaudry

Hi Rolf, thanks for your input. You're right, I miss the coll_tuned_use_dynamic_rules option. I'll check if I the segmentation fault disappears when using the basic bcast linear algorithm using the proper command line you provided. Regards, Eloi On Tuesday 13 July 2010 20:39:59 Rolf vandeVaar

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Rolf vandeVaart

Hi Eloi: To select the different bcast algorithms, you need to add an extra mca parameter that tells the library to use dynamic selection. --mca coll_tuned_use_dynamic_rules 1 One way to make sure you are typing this in correctly is to use it with ompi_info. Do the following: ompi_info -mca

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Eloi Gaudry

Hi, I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the basic linear algorithm. Anyway whatever the algorithm used, the segmentation fault remains. Does anyone could give some advice on ways to diagnose the issue I'm facing ? Regards, Eloi On Monday 12 July 2010 10:

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-12 Thread Eloi Gaudry

Hi, I'm focusing on the MPI_Bcast routine that seems to randomly segfault when using the openib btl. I'd like to know if there is any way to make OpenMPI switch to a different algorithm than the default one being selected for MPI_Bcast. Thanks for your help, Eloi On Friday 02 July 2010 11:06:

[OMPI users] [openib] segfault when using openib btl

2010-07-02 Thread Eloi Gaudry

Hi, I'm observing a random segmentation fault during an internode parallel computation involving the openib btl and OpenMPI-1.4.2 (the same issue can be observed with OpenMPI-1.3.3). mpirun (Open MPI) 1.4.2 Report bugs to http://www.open-mpi.org/community/help/ [pbn08:02624] *** Process

49 matches

Mail list logo