[OMPI users] OpenIB problems

2007-11-21 Thread Brock Palen
We have a user whos code keep failing at a similar point in the code. The errors (below) would make me think its a fabric problem, but ibcheckerrors is not returning any issues. He is using openmpi-1.2.0 With OFED on RHEL4, Far field AIM propagators require(MB):1.441955566406250 Arra

[OMPI users] openib problems

2008-01-10 Thread Brock Palen
We just updated rhel4 a few days back and now we get the following errors when trying to run on infiniband nodes with openmpi-1.2.3 and openmpi-1.2.0 [0,1,1]: OpenIB on host nyx397 was unable to find any HCAs. Another transport will be used instead, although this may result in lower performan

Re: [OMPI users] OpenIB problems

2007-11-21 Thread Brian Dobbins
Hi Brock We have a user whos code keep failing at a similar point in the code. The errors (below) would make me think its a fabric problem, but ibcheckerrors is not returning any issues. He is using openmpi-1.2.0 With OFED on RHEL4, Strangely enough, I hit this exact problem about ha

Re: [OMPI users] OpenIB problems

2007-11-21 Thread Andrew Friedley
If this is what I think it is, try using this MCA parameter: -mca btl_openib_ib_timeout 20 If this fixes it -- I don't fully understand what's going on, but it's an issue in the IB fabrics itself. Someone else might be able to explain in more detail.. Andrew Brian Dobbins wrote: Hi Brock

Re: [OMPI users] OpenIB problems

2007-11-21 Thread Brock Palen
Hi Brock We have a user whos code keep failing at a similar point in the code. The errors (below) would make me think its a fabric problem, but ibcheckerrors is not returning any issues. He is using openmpi-1.2.0 With OFED on RHEL4, Strangely enough, I hit this exact problem about half an

Re: [OMPI users] OpenIB problems

2007-11-21 Thread Brock Palen
Thanks, We have asked the user to try that and let us know if it fails. I will let the list know if this works. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote: If this is what I think it is, try using this MCA par

Re: [OMPI users] OpenIB problems

2007-11-21 Thread Brian Dobbins
Hi Andrew, Brock, and everyone else, Andrew Friedley wrote: If this is what I think it is, try using this MCA parameter: -mca btl_openib_ib_timeout 20 Just FYI, in addition to the above, I retried using the gigabit links ('--mca btl tcp,self', right?) and that failed too, so at least in /m

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Brock Palen
On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote: If this is what I think it is, try using this MCA parameter: -mca btl_openib_ib_timeout 20 The user used this option and it allowed the run to complete. You say its a issue with the fabric ibshowerrors does not show any problems. Its to

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Andrew Friedley
Brock Palen wrote: On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote: If this is what I think it is, try using this MCA parameter: -mca btl_openib_ib_timeout 20 The user used this option and it allowed the run to complete. You say its a issue with the fabric ibshowerrors does not show any

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Brock Palen
On Nov 27, 2007, at 10:49 AM, Andrew Friedley wrote: Brock Palen wrote: On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote: If this is what I think it is, try using this MCA parameter: -mca btl_openib_ib_timeout 20 The user used this option and it allowed the run to complete. You say its

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Andrew Friedley
Brock Palen wrote: On Nov 27, 2007, at 10:49 AM, Andrew Friedley wrote: Brock Palen wrote: On Nov 21, 2007, at 3:39 PM, Andrew Friedley wrote: If this is what I think it is, try using this MCA parameter: -mca btl_openib_ib_timeout 20 The user used this option and it allowed the run to com

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Brock Palen
What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ? The other IB guys can probably answer better than I can -- I'm not an expert in this part of IB (or really any part I guess :). Not sure why a

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Andrew Friedley
Brock Palen wrote: What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ? The other IB guys can probably answer better than I can -- I'm not an expert in this part of IB (or really any part I guess :).

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Brock Palen
Ok i will open a case with cisco, Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote: Brock Palen wrote: What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Jeff Squyres
Sorry for jumping in late; the holiday and other travel prevented me from getting to all my mail recently... :-\ Have you checked the counters on the subnet manager to see if any other errors are occurring? It might be good to clear all the counters, run the job, and see if the counters a

Re: [OMPI users] OpenIB problems

2007-11-27 Thread Jeff Squyres
BTW, Andrew is correct about the unit for btl_openib_ib_timeout and that the value is simply passed down to the verbs library when making an IB connection. Open MPI does nothing else with that value; it's an IBTA-defined value. The help message was wrong on the 1.2 branch for a while; I th

Re: [OMPI users] OpenIB problems

2007-11-28 Thread Jeff Squyres
Roland thought that the default value of 10 might be a bit too low and that tuning it to be higher, particularly in apps that pound on a single port, would probably be acceptable. Tuning up to 20 is probably a bit overkill. On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote: BTW, Andrew is c

Re: [OMPI users] OpenIB problems

2007-11-28 Thread Andrew Friedley
What value do you suggest then? I know I've seen the problem persist at values of 14 and 16, and would rather be certain that this isn't going to kill the job that just sat in the queue for a week. Andrew Jeff Squyres wrote: Roland thought that the default value of 10 might be a bit too low a

Re: [OMPI users] OpenIB problems

2007-11-28 Thread Ogden, Jeffry Brandon
e- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Andrew Friedley > Sent: Wednesday, November 28, 2007 9:36 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenIB problems > > What value do you suggest then? I know I've seen th

Re: [OMPI users] OpenIB problems

2007-11-28 Thread Brock Palen
Jeff thanks for all the reply's, Hate to admit but at the moment we can't log onto the switch. But the ibcheckerrors command returns nothing out of bounds, and i think that command also checks the switch ports. Thanks, we will do some tests Brock Palen Center for Advanced Computing bro...@u

Re: [OMPI users] OpenIB problems

2007-11-29 Thread Neeraj Chourasia
Hi Guys, The alternative to THREAD_MULTIPLE problem is to use --mca mpi_leave_pinned 1 to mpirun option. This will ensure 1 RDMA operation contrary to splitting data in MAX RDMA size (default to 1MB). If your data size is small say below 1 MB, program will run well with THREAD_MULTIPLE. P

Re: [OMPI users] openib problems

2008-01-10 Thread Jeff Squyres
This can mean that you have a user-level libibverbs and kernel mismatch. Do any of the OFED sample programs work properly, or perhaps the ibv_devinfo program? (ibv_devinfo should query the HCAs on your host and list the status of all the ports) On Jan 10, 2008, at 2:33 PM, Brock Palen wr

Re: [OMPI users] openib problems

2008-01-12 Thread rahmani
Hi add the following line in /etc/openmpi-mca-params.conf btl=^openib - Original Message - From: "Jeff Squyres" To: "Open MPI Users" Sent: Friday, January 11, 2008 12:32:10 AM (GMT+0330) Asia/Tehran Subject: Re: [OMPI users] openib problems This can mean that y