Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-06 Thread Yevgeny Kliteynik
Joseph, Indeed, there was a problem in the MXM rpm. The fixed MXM has been published at the same location: http://mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar -- YK On 12/4/2012 9:20 AM, Joseph Farran wrote: > Hi Mike. > > Removed the old mxm, downloaded and installed: > > /tmp/mxm/v

Re: [OMPI users] OpenMPI 1.6.3 and Memory Issues

2012-11-29 Thread Yevgeny Kliteynik
You can also set these parameters in /etc/modprobe.conf: options mlx4_core log_num_mtt=24 log_mtts_per_seg=1 -- YK On 11/30/2012 2:12 AM, Yevgeny Kliteynik wrote: > On 11/30/2012 12:47 AM, Joseph Farran wrote: >> I'll assume: /etc/modprobe.d/mlx4_en.conf > > Add thes

Re: [OMPI users] OpenMPI 1.6.3 and Memory Issues

2012-11-29 Thread Yevgeny Kliteynik
On 11/30/2012 12:47 AM, Joseph Farran wrote: > I'll assume: /etc/modprobe.d/mlx4_en.conf Add these to /etc/modprobe.d/mofed.conf: options mlx4_core log_num_mtt=24 options mlx4_core log_mtts_per_seg=1 And then restart the driver. You need to do it on all the machines. -- YK > > On 11/29/2012 0

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-11-29 Thread Yevgeny Kliteynik
Joseph, On 11/29/2012 11:50 PM, Joseph Farran wrote: > make[2]: Entering directory > `/data/apps/sources/openmpi-1.6.3/ompi/mca/mtl/mxm' > CC mtl_mxm.lo > CC mtl_mxm_cancel.lo > CC mtl_mxm_component.lo > CC mtl_mxm_endpoint.lo > CC mtl_mxm_probe.lo > CC mtl_mxm_recv.lo > CC mtl_mxm_send.lo > CCLD

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-11-28 Thread Yevgeny Kliteynik
On 11/28/2012 10:52 AM, Pavel Mezentsev wrote: > You can try downloading and installing a fresher version of MXM from mellanox > web site. There was a thread on the list with the same problem, you can > search for it. Indeed, that OFED version comes with older version of MXM. You can get the new

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-10 Thread Yevgeny Kliteynik
------ >

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-09 Thread Yevgeny Kliteynik
Randolph, On 9/7/2012 7:43 AM, Randolph Pullen wrote: > Yevgeny, > The ibstat results: > CA 'mthca0' > CA type: MT25208 (MT23108 compat mode) What you have is InfiniHost III HCA, which is 4x SDR card. This card has theoretical peak of 10 Gb/s, which is 1GB/s in IB bit coding. > And more interest

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-06 Thread Yevgeny Kliteynik
------

Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time

2012-09-05 Thread Yevgeny Kliteynik
On 9/4/2012 7:21 PM, Yong Qin wrote: > On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik > wrote: >> On 8/30/2012 10:28 PM, Yong Qin wrote: >>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres wrote: >>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote: >>>&g

Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time

2012-09-04 Thread Yevgeny Kliteynik
On 8/30/2012 10:28 PM, Yong Qin wrote: > On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres wrote: >> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote: >> >>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but >>> not on 1.4.5 (tcp btl is always fine). The application is VASP and >>> onl

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-02 Thread Yevgeny Kliteynik
Randolph, Some clarification on the setup: "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to Ethernet? That is, when you're using openib BTL, you mean RoCE, right? Also, have you had a chance to try some newer OMPI release? Any 1.6.x would do. -- YK On 8/31/2012 10:53 AM,

Re: [OMPI users] InfiniBand path migration not working

2012-03-11 Thread Yevgeny Kliteynik
Hi, I just noticed that my previous mail bounced, but it doesn't matter. Please ignore it if you got it anyway - I re-read the thread and there is a much simpler way to do it. If you want to check whether LID L is reachable through HCA H from port P, you can run this command: smpquery --Ca H

Re: [OMPI users] openib btl and MPI_THREAD_MULTIPLE

2012-01-25 Thread Yevgeny Kliteynik
On 24-Jan-12 5:59 PM, Ronald Heerema wrote: > I was wondering if anyone can comment on the current state of support for the > openib btl when MPI_THREAD_MULTIPLE is enabled. Short version - it's not supported. Longer version - no one really spent time on testing it and fixing all the places where

Re: [OMPI users] IB Memory Requirements, adjusting for reduced memory consumption

2012-01-15 Thread Yevgeny Kliteynik
On 13-Jan-12 12:23 AM, Nathan Hjelm wrote: > I would start by adjusting btl_openib_receive_queues . The default uses > a per-peer QP which can eat up a lot of memory. I recommend using no > per-peer and several shared receive queues. > We use S,4096,1024:S,12288,512:S,65536,512 And here's the FAQ

Re: [OMPI users] Problems when running open-MPI on OFED

2011-12-29 Thread Yevgeny Kliteynik
Hi, Does OMPI with IMP work OK on the official OFED release? Do the usual ibv performance tests (ibv_rc_*) work on your customized OFED? -- YK On 29-Dec-11 9:34 AM, Venkateswara Rao Dokku wrote: > Hi, > We tried running the Intel Benchmarks(IMB_3.2) on the customized > OFED(that was build

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-19 Thread Yevgeny Kliteynik
Hi, > By any chance is it a particular node (or pair of nodes) this seems to > happen with? No. I've got 40 nodes total with this hardware configuration, and the problem has been seen on most/all nodes at one time or another. It doesn't seem, based on the limited numb

Re: [OMPI users] Error launching w/ 1.5.3 on IB mthca nodes

2011-12-18 Thread Yevgeny Kliteynik
On 16-Dec-11 4:28 AM, Jeff Squyres wrote: > Very strange. I have a lot of older mthca-based HCAs in my Cisco MPI test > cluster, and I don't see these kinds of problems. > > Mellanox -- any ideas? So if I understand it right, you have a mixed cluster - some machines with ConnecX HCAs family (ml

Re: [OMPI users] problem running with RoCE over 10GbE

2011-10-06 Thread Yevgeny Kliteynik
On 05-Oct-11 3:41 PM, Jeff Squyres wrote: > On Oct 5, 2011, at 9:35 AM, Yevgeny Kliteynik wrote: > >>> Yevgeny -- can you check that out? >> >> Yep, indeed - configure doesn't abort when "--enable-openib-rdmacm" >> is provided and "rdma/rdma

Re: [OMPI users] problem running with RoCE over 10GbE

2011-10-05 Thread Yevgeny Kliteynik
On 05-Oct-11 3:15 PM, Jeff Squyres wrote: >> You shouldn't use the "--enable-openib-rdmacm" option - rdmacm >> support is enabled by default, providing librdmacm is found on >> the machine. > > Actually, this might be a configure bug. We have lots of other configure > options that, even if "foo"

Re: [OMPI users] problem running with RoCE over 10GbE

2011-10-05 Thread Yevgeny Kliteynik
Jeff, On 01-Oct-11 1:01 AM, Konz, Jeffrey (SSA Solution Centers) wrote: > Encountered a problem when trying to run OpenMPI 1.5.4 with RoCE over 10GbE > fabric. > > Got this run time error: > > An invalid CPC name was specified via the btl_openib_cpc_include MCA > parameter. > >Local host:

Re: [OMPI users] RE : RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

2011-09-26 Thread Yevgeny Kliteynik
On 26-Sep-11 11:27 AM, Yevgeny Kliteynik wrote: > On 22-Sep-11 12:09 AM, Jeff Squyres wrote: >> On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote: >> >>>> What happens if you run 2 ibv_rc_pingpong's on each node? Or N >>>> ibv_rc_pingpongs? &g

Re: [OMPI users] RE : RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

2011-09-26 Thread Yevgeny Kliteynik
On 22-Sep-11 12:09 AM, Jeff Squyres wrote: > On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote: > >>> What happens if you run 2 ibv_rc_pingpong's on each node? Or N >>> ibv_rc_pingpongs? >> >> With 11 ibv_rc_pingpong's >> >> http://pastebin.com/85sPcA47 >> >> Code to do that => https://gist

Re: [OMPI users] Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

2011-09-20 Thread Yevgeny Kliteynik
Hi Sébastien, If I understand you correctly, you are running your application on two different MPIs on two different clusters with two different IB vendors. Could you make a comparison more "apples to apples"-ish? For instance: - run the same version of Open MPI on both clusters - run the same

Re: [OMPI users] OpenIB error messages: reporting the default or telling you what's happening?

2011-09-19 Thread Yevgeny Kliteynik
On 14-Sep-11 12:59 PM, Jeff Squyres wrote: > On Sep 13, 2011, at 6:33 PM, kevin.buck...@ecs.vuw.ac.nz wrote: > >> there have been two runs of jobs that invoked the mpirun using these >> OpenMPI parameter setting flags (basically, these mimic what I have >> in the global config file) >> >> -mca btl

Re: [OMPI users] Infiniband Error

2011-09-12 Thread Yevgeny Kliteynik
This means that you have some problem on that node, and it's probably unrelated to Open MPI. Bad cable? Bad port? FW/driver in some bad state? Do other IB performance tests work OK on this node? Try rebooting the node. -- YK On 12-Sep-11 7:52 AM, Ahsan Ali wrote: > Hello all > > I am getting fol

Re: [OMPI users] btl_openib_ipaddr_include broken in 1.4.4rc2?

2011-09-04 Thread Yevgeny Kliteynik
On 30-Aug-11 4:50 PM, Michael Shuey wrote: > I'm using RoCE (or rather, attempting to) and need to select a > non-default GID to get my traffic properly classified. You probably saw it, but just making sure: http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce > Both 1.4.4rc2 > and 1.

Re: [OMPI users] ConnectX with InfiniHost IB HCAs

2011-08-27 Thread Yevgeny Kliteynik
Egor, If updating OFED doesn't solve the problem (and I kinda have the feeling that it does), you might want to try this mailing list for IB interoperability questions: linux-r...@vger.kernel.org -- YK On 26-Aug-11 4:42 PM, Shamis, Pavel wrote: > You may try to update your OFED version. I think

Re: [OMPI users] Open MPI 1.4: [connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory

2011-08-02 Thread Yevgeny Kliteynik
plication Performance Tools Group > Computer Science and Math Division > Oak Ridge National Laboratory > > > > > > > On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote: > >> Hi, >> >> Please try running OMPI with XRC: >> >> m

Re: [OMPI users] Open MPI 1.4: [connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Cannot allocate memory

2011-08-01 Thread Yevgeny Kliteynik
Hi, Please try running OMPI with XRC: mpirun --mca btl openib... --mca btl_openib_receive_queues X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ... XRC (eXtended Reliable Connection) decreases memory consumption of Open MPI by decreasing number of QP per machine. I

Re: [OMPI users] InfiniBand, different OpenFabrics transport types

2011-07-14 Thread Yevgeny Kliteynik
On 11-Jul-11 5:23 PM, Bill Johnstone wrote: > Hi Yevgeny and list, > > - Original Message - > >> From: Yevgeny Kliteynik > >> I'll check the MCA_BTL_OPENIB_TRANSPORT_UNKNOWN thing and get back to you. > > Thank you. That's interesting... T

Re: [OMPI users] Error-Open MPI over Infiniband: polling LP CQ with status LOCAL LENGTH ERROR

2011-07-10 Thread Yevgeny Kliteynik
Hi Yiguang, On 08-Jul-11 4:38 PM, ya...@adina.com wrote: > Hi all, > > The message says : > > [[17549,1],0][btl_openib_component.c:3224:handle_wc] from > gulftown to: gulftown error polling LP CQ with status LOCAL > LENGTH ERROR status number 1 for wr_id 492359816 opcode > 32767 vendor error 10

Re: [OMPI users] InfiniBand, different OpenFabrics transport types

2011-07-10 Thread Yevgeny Kliteynik
Hi Bill, On 08-Jul-11 7:59 PM, Bill Johnstone wrote: > Hello, and thanks for the reply. > > > > - Original Message - >> From: Jeff Squyres >> Sent: Thursday, July 7, 2011 5:14 PM >> Subject: Re: [OMPI users] InfiniBand, different OpenFabrics transport types >> >> On Jun 28, 2011, at 1:4

Re: [OMPI users] Fwd: gadget2 infiniband openmpi hang

2011-05-29 Thread Yevgeny Kliteynik
Gretchen, Could you please send stack-trace of the processes when it hangs? (with padb/gdb) Does the same problem persist in small scale (2,3 nodes)? What is the minimal setup that reproduces the problem? -- YK > > -- Forwarded message -- > From: *Gretchen* mailto:umassastroh..

Re: [OMPI users] alltoall messages > 2^26

2011-05-29 Thread Yevgeny Kliteynik
Michael, Could you try to run this again with "--mca mpi_leave_pinned 0" parameter? I suspect that this might be due to a message size problem - MPI tries to do RDMA with a message bigger than what HCA supports. -- YK On 11-Apr-11 7:44 PM, Michael Di Domenico wrote: > Here's a chunk of code that

Re: [OMPI users] printf and scanf problem of C code compiled with Open MPI

2011-03-31 Thread Yevgeny Kliteynik
You can explicitly specify the type of buffering that you want to get with setvbuf() C function. It can be block-buffered, line-buffered and unbuffered. Stdout is line-buffered by default. To make it un-buffered, you need something like this: setvbuf(stdout, NULL, _IONBF, 0) -- YK On 30-Mar-11