On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote:
> It would simplify testing if you could get all the eth0's to be of one type
> and on the same subnet, and the same for eth1.
>
> Once you do that, try using just one of the networks by telling OMPI to use
> only one of the devices, somethin
On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres wrote:
> Once you do that, try using just one of the networks by telling OMPI to use
> only one of the devices, something like this:
>
> mpirun --mca btl_tcp_if_include eth0 ...
Thanks Jeff! Just tried the exact test that you suggested.
[rpnabar
On Wed, Aug 25, 2010 at 6:41 AM, John Hearns wrote:
> You could sort that out with udev rules on each machine.
Sure. I'd always wanted consistent names for the eth interfaces when I
set up the cluster but I couldn't get udev to co-operate. Maybe this
time! Let me try.
> Look in the directory /et
On Aug 24, 2010, at 6:26 PM, Rahul Nabar wrote:
>> Are all the eth0's on one subnet and all the eth2's on a different subnet?
>>
>> Or are all eth0's and eth2's all on the same subnet?
>
> Thanks Jeff! Different subnets. All 10GigE's are on 192.168.x.x and
> all 1GigE's are on 10.0.x.x
It would
On Thu, Aug 19, 2010 at 9:03 PM, Rahul Nabar wrote:
> --
> gather:
> NP256 hangs
> NP128 hangs
> NP64 hangs
> NP32 OK
>
> Note: "gather" always hangs at the following line of the test:
> #bytes #repetitio
On 24 August 2010 18:58, Rahul Nabar wrote:
> There are a few unusual things about the cluster. We are using a
> 10GigE ethernet fabric. Each node has dual eth adapters. One 1GigE and
> the other 10GigE. These are on seperate subnets although the order of
> the eth interfaces is variable. i.e. 10G
/8/10, Rahul Nabar wrote:
From: Rahul Nabar
Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts:
debug ideas?
To: "Open MPI Users"
Received: Wednesday, 25 August, 2010, 3:38 AM
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
wrote:
>
> I have had a
On Tue, Aug 24, 2010 at 4:58 PM, Jeff Squyres wrote:
> Are all the eth0's on one subnet and all the eth2's on a different subnet?
>
> Or are all eth0's and eth2's all on the same subnet?
Thanks Jeff! Different subnets. All 10GigE's are on 192.168.x.x and
all 1GigE's are on 10.0.x.x
e.g.
On Aug 24, 2010, at 1:58 PM, Rahul Nabar wrote:
> There are a few unusual things about the cluster. We are using a
> 10GigE ethernet fabric. Each node has dual eth adapters. One 1GigE and
> the other 10GigE. These are on seperate subnets although the order of
> the eth interfaces is variable. i.e.
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB messa
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
My MPI version is 1.4.1. This isn't the latest but still fairly
recent. So I assume th
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
wrote:
>
> I have had a similar load related problem with Bcast.
Thanks Randolph! That's interesting to know! What was the hardware you
were using? Does your bcast fail at the exact same point too?
>
> I don't know what caused it though. With thi
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann wrote:
> It is hard to imagine how a total data load of 41,943,040 bytes could be a
> problem. That is really not much data. By the time the BCAST is done, each
> task (except root) will have received a single half meg message form one
> sender. Th
Network saturation could produce arbitrary long delays the total data load
we are talking about is really small. It is the responsibility of an MPI
library to do one of the following:
1) Use a reliable message protocol for each message (e.g. Infiniband RC or
TCP/IP)
2) detect lost packets and
.org wrote on 08/23/2010 05:09:56
PM:
> [image removed]
>
> Re: [OMPI users] IMB-MPI broadcast test stalls for large core
> counts: debug ideas?
>
> Rahul Nabar
>
> to:
>
> Open MPI Users
>
> 08/23/2010 05:11 PM
>
> Sent by:
>
& Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
users-boun...@open-mpi.org wrote on 08/23/2010 05:09:56 PM:
> [image removed]
>
> Re: [OMPI users] IMB-MPI broadcast test stalls for large core
> cou
On Sun, Aug 22, 2010 at 9:57 PM, Randolph Pullen <
randolph_pul...@yahoo.com.au> wrote:
> Its a long shot but could it be related to the total data volume ?
> ie 524288 * 80 = 41943040 bytes active in the cluster
>
> Can you exceed this 41943040 data volume with a smaller message repeated
> more
Subject: [OMPI users] IMB-MPI broadcast test stalls for large core counts:
debug ideas?
To: "Open MPI Users"
Received: Friday, 20 August, 2010, 12:03 PM
My Intel IMB-MPI tests stall, but only in very specific cases:larger
packet sizes + large core counts. Only happens for bcast,
My Intel IMB-MPI tests stall, but only in very specific cases:larger
packet sizes + large core counts. Only happens for bcast, gather and
exchange tests. Only for the larger core counts (~256 cores). Other
tests like pingpong and sendrecev run fine even with larger core
counts.
e.g. This bcast tes
19 matches
Mail list logo