In some of the testing Eloi did earlier he did disabled eager rdma and
still saw the issue.
--td
Shamis, Pavel wrote:
Terry,
Ishai Rabinovitz is HPC team manager (I added him to CC)
Eloi,
Back to issue. I have seen very similar issue long time ago on some hardware
platforms that support
Pasha,
Thanks for your help.
I'm not aware of such memory configuration on the new cluster of our
customer (each computing node is running the Red-Hat 5.x operating
system on Intel X5570 processors).
Anyway, I've already tried to deactivate eager_rdma, but this wouldn't
solve the hdr->tag=0
Terry,
Ishai Rabinovitz is HPC team manager (I added him to CC)
Eloi,
Back to issue. I have seen very similar issue long time ago on some hardware
platforms that support relaxed ordering memory operations. If I remember
correct it was some IBM platform.
Do you know if relaxed memory ordering
Pasha, do you by any chance know who at Mellanox might be responsible
for OMPI working?
--td
Eloi Gaudry wrote:
Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may
discuss with, preferably someone who has spent some time
Hi Nysal, Terry,
Thanks for your input on this issue.
I'll follow your advice. Do you know any Mellanox developer I may
discuss with, preferably someone who has spent some time inside the
openib btl ?
Regards,
Eloi
On 29/09/2010 06:01, Nysal Jan wrote:
Hi Eloi,
We discussed this issue
; > >>>>>>>>>>>>> Hi Nysal,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for your suggestions.
> > >>>>>>>>>>>>>
> >
this error.
Regards,
Eloi
--
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone: +32 10 487 959
------ Forwarded message --
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
>>>>>>>>> option, you were right). I haven't been able to observe the
> >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when using pml
> >>>>>>>>>>>>> csum) but I 'll let
this error.
Regards,
Eloi
--
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone: +32 10 487 959
-- Forwarded message ------
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:
;>>>>> instance) ; i followed the guidelines given at
> >>>>>>>>>>> http://icl.cs.utk.edu/open-
> >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
> >>>>&g
;>>>>> instance) ; i followed the guidelines given at
> >>>>>>>>>>> http://icl.cs.utk.edu/open-
> >>>>>>>>>>> mpi/faq/?category=openfabrics#ib-small-message-rdma but the
> >>>>&g
Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using openib btl
Hi,
I was wondering if anybody got a chance to have a look at this
issue.
Regards,
Eloi
On Wednesday 18 August 2010 0
t;>>>>>> ConnectX IB DDR, PCIe 2.0
> >>>>>>> 2.5GT/s, rev a0) with our time-domain software.
> >>>>>>>
> >>>>>>> I checked, double-checked, and rechecked again every MPI use
> >>>>>>> perf
.
> >>>>>
> >>>>> I've just used the "-mca pml csum" option and I haven't seen any
> >>>>> related messages (when hdr->tag=0 and the segfaults occurs).
> >>>>> Any suggestion ?
> >>>>>
> >>>>> Regards,
> >>&
at
this error.
Regards,
Eloi
--
Eloi Gaudry
Free Field Technologies
Company Website: http://www.fft.be
Company Phone: +32 10 487 959
-- Forwarded message --
From: Eloi Gaudry <e...@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: We
y it at the receiver side for any data corruption. You can try
> >>>> using it to see if it is able
> >>>
> >>> to
> >>>
> >>>> catch anything.
> >>>>
> >>>> Regards
> >>>> --Nysal
> &g
.@fft.be>
To: Open MPI Users <us...@open-mpi.org>
Date: Wed, 15 Sep 2010 16:27:43 +0200
Subject: Re: [OMPI users] [openib] segfault when using openib btl
Hi,
I was wondering if anybody got a chance to have a look at this issue.
Regards,
Eloi
On Wednesday 18 August 2010 09:16
> > > On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry <e...@fft.be> wrote:
> > > > Hi Nysal,
> > > >
> > > > I'm sorry to intrrupt, but I was wondering if you had a chance to
> > > > look
> >
> > at
> >
> > > > this
Regards,
> > Eloi
> >
> >
> >
> > --
> >
> >
> > Eloi Gaudry
> >
> > Free Field Technologies
> > Company Website: http://www.fft.be
> > Company Phone: +32 10 487 959
> >
> >
> > -- Forwarded
fft.be
> Company Phone: +32 10 487 959
>
>
> -- Forwarded message --
> From: Eloi Gaudry <e...@fft.be>
> To: Open MPI Users <us...@open-mpi.org>
> Date: Wed, 15 Sep 2010 16:27:43 +0200
> Subject: Re: [OMPI users] [openib] segfault when using openib
Hi,
I was wondering if anybody got a chance to have a look at this issue.
Regards,
Eloi
On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> Hi Jeff,
>
> Please find enclosed the output (valgrind.out.gz) from
> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
>
Hi Jeff,
here is the valgrind output when using OpenMPI -1.5rc5, just in case.
Thanks,
Eloi
On Wednesday 18 August 2010 23:01:49 Jeff Squyres wrote:
> On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote:
> > would it help if i use the upcoming 1.5 version of openmpi ? i read that
> > a huge effort
On Aug 17, 2010, at 12:32 AM, Eloi Gaudry wrote:
> would it help if i use the upcoming 1.5 version of openmpi ? i read that a
> huge effort has been done to clean-up the valgrind output ? but maybe that
> this doesn't
> concern this btl (for the reasons you mentionned).
I do not believe that
Hi Jeff,
Please find enclosed the output (valgrind.out.gz) from
/opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
btl_openib_want_fork_support 0 -tag-output /opt/valgrind-3.5.0/bin/valgrind
--tool=memcheck
Hi Nysal,
There is only one thread invoking MPI functions in our applications. Others
threads are related to flexlm protection routines and some self-diagnostics
routines
that don't use any MPI functions. I built a version of our application, just ot
be sure, without any other thread that the
Hi Eloi,
>Do you think that a thread race condition could explain the hdr->tag value
?
Are there multiple threads invoking MPI functions in your application? The
openib BTL is not yet thread safe in the 1.4 release series. There have been
improvements to openib BTL thread safety in 1.5, but it is
Hi Nysal,
This is what I was wondering, it hdr->tag was expected to be null or not. I'll
soon send a valgrind output to the list, hoping this could help to locate an
invalid
memory access allowing to understand why reg->cbfunc / hdr->tag are null.
Do you think that a thread race condition
On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > I did run our application through valgrind but it couldn't find any
> > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > with the suppression file), "Use of
The value of hdr->tag seems wrong.
In ompi/mca/pml/ob1/pml_ob1_hdr.h
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET (MCA_BTL_TAG_PML + 3)
#define MCA_PML_OB1_HDR_TYPE_ACK
Hi Jeff,
Thanks for your reply.
I did run our application through valgrind but it couldn't find any
"Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
with the suppression file), "Use of uninitialized bytes" and
"Conditional jump depending on uninitialized bytes" in
Sorry for the delay in replying.
Odd; the values of the callback function pointer should never be 0. This seems
to suggest some kind of memory corruption is occurring.
I don't know if it's possible, because the stack trace looks like you're
calling through python, but can you run this
Hi,
sorry, i just forgot to add the values of the function parameters:
(gdb) print reg->cbdata
$1 = (void *) 0x0
(gdb) print openib_btl->super
$2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
btl_rdma_pipeline_send_length =
Hi,
Here is the output of a core file generated during a segmentation fault
observed during a collective call (using openib):
#0 0x in ?? ()
(gdb) where
#0 0x in ?? ()
#1 0x2aedbc4e05f4 in btl_openib_handle_incoming (openib_btl=0x1902f9b0,
ep=0x1908a1c0,
Hi Edgar,
The only difference I could observed was that the segmentation fault appeared
sometimes later during the parallel computation.
I'm running out of idea here. I wish I could use the "--mca coll tuned" with
"--mca self,sm,tcp" so that I could check that the issue is not somehow limited
On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> hi edgar,
>
> thanks for the tips, I'm gonna try this option as well. the segmentation
> fault i'm observing always happened during a collective communication
> indeed...
> does it basically switch all collective communication to basic mode, right ?
>
hi edgar,
thanks for the tips, I'm gonna try this option as well. the segmentation fault
i'm observing always happened during a collective communication indeed...
does it basically switch all collective communication to basic mode, right ?
sorry for my ignorance, but what's a NCA ?
thanks,
you could try first to use the algorithms in the basic module, e.g.
mpirun -np x --mca coll basic ./mytest
and see whether this makes a difference. I used to observe sometimes a
(similar ?) problem in the openib btl triggered from the tuned
collective component, in cases where the ofed libraries
hi Rolf,
unfortunately, i couldn't get rid of that annoying segmentation fault when
selecting another bcast algorithm.
i'm now going to replace MPI_Bcast with a naive implementation (using MPI_Send
and MPI_Recv) and see if that helps.
regards,
éloi
On Wednesday 14 July 2010 10:59:53 Eloi
Hi Rolf,
thanks for your input. You're right, I miss the coll_tuned_use_dynamic_rules
option.
I'll check if I the segmentation fault disappears when using the basic bcast
linear algorithm using the proper command line you provided.
Regards,
Eloi
On Tuesday 13 July 2010 20:39:59 Rolf
Hi Eloi:
To select the different bcast algorithms, you need to add an extra mca
parameter that tells the library to use dynamic selection.
--mca coll_tuned_use_dynamic_rules 1
One way to make sure you are typing this in correctly is to use it with
ompi_info. Do the following:
ompi_info -mca
Hi,
I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the
basic linear algorithm.
Anyway whatever the algorithm used, the segmentation fault remains.
Does anyone could give some advice on ways to diagnose the issue I'm facing ?
Regards,
Eloi
On Monday 12 July 2010
Hi,
I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
using the openib btl.
I'd like to know if there is any way to make OpenMPI switch to a different
algorithm than the default one being selected for MPI_Bcast.
Thanks for your help,
Eloi
On Friday 02 July 2010
Hi,
I'm observing a random segmentation fault during an internode parallel
computation involving the openib btl and OpenMPI-1.4.2 (the same issue
can be observed with OpenMPI-1.3.3).
mpirun (Open MPI) 1.4.2
Report bugs to http://www.open-mpi.org/community/help/
[pbn08:02624] *** Process
43 matches
Mail list logo