Re: [OMPI users] Using physical numbering in a rankfile

2012-02-02 Thread teng ma
I made a mistake in the previous reply. You can use two ways here like:
rank 0=host1 slot=0
rank 1=host1 slot=2
rank 2=host1 slot=4
rank 3=host1 slot=6
rank 4=host1 slot=1
rank 5=host1 slot=3
rank 6=host1 slot=5
rank 7=host1 slot=7

or

rank 0=host1 slot=0:0
rank 1=host1 slot=0:1
rank 2=host1 slot=0:2
rank 3=host1 slot=0:3
rank 4=host1 slot=1:0
rank 5=host1 slot=1:1
rank 6=host1 slot=1:2
rank 7=host1 slot=1:3

Teng


On Thu, Feb 2, 2012 at 12:17 PM, teng ma  wrote:

> Just remove p in your rankfile like
>
> rank 0=host1 slot=0:0
> rank 1=host1 slot=0:2
> rank 2=host1 slot=0:4
> rank 3=host1 slot=0:6
> rank 4=host1 slot=1:1
> rank 5=host1 slot=1:3
> rank 6=host1 slot=1:5
> rank 7=host1 slot=1:7
>
> Teng
>
> 2012/2/2 François Tessier 
>
>>  Hello,
>>
>> I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic
>> architecture. I'm using a node for which lstopo returns that :
>>
>> 
>> Machine (24GB)
>>   NUMANode L#0 (P#0 12GB)
>> Socket L#0 + L3 L#0 (8192KB)
>>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
>>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4)
>>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
>> HostBridge L#0
>>   PCIBridge
>> PCI 8086:10c9
>>   Net L#0 "eth0"
>> PCI 8086:10c9
>>   Net L#1 "eth1"
>>   PCIBridge
>> PCI 15b3:673c
>>   Net L#2 "ib0"
>>   Net L#3 "ib1"
>>   OpenFabrics L#4 "mlx4_0"
>>   PCIBridge
>> PCI 102b:0522
>>   PCI 8086:3a22
>> Block L#5 "sda"
>> Block L#6 "sdb"
>> Block L#7 "sdc"
>> Block L#8 "sdd"
>>   NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
>> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
>> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3)
>> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5)
>> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>> 
>>
>> And I would like to use the physical numbering. To do that, I created a
>> rankfile like this :
>>
>> rank 0=host1 slot=p0:0
>> rank 1=host1 slot=p0:2
>> rank 2=host1 slot=p0:4
>> rank 3=host1 slot=p0:6
>> rank 4=host1 slot=p1:1
>> rank 5=host1 slot=p1:3
>> rank 6=host1 slot=p1:5
>> rank 7=host1 slot=p1:7
>>
>> But when I run my job with "*mpiexec -np 8 --rankfile rankfile ./foo*",
>> I encounter this error :
>>
>> *Specified slot list: p0:4
>> Error: Not found
>>
>> This could mean that a non-existent processor was specified, or
>> that the specification had improper syntax.*
>>
>>
>> Do you know what I did wrong?
>>
>> Best regards,
>>
>> François
>>
>> --
>> ___
>> François TESSIER
>> PhD Student at University of Bordeaux
>> Tel : 0033.5.24.57.41.52francois.tess...@inria.fr
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> | Teng Ma  Univ. of Tennessee |
> | t...@cs.utk.eduKnoxville, TN |
> | http://web.eecs.utk.edu/~tma/   |
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


Re: [OMPI users] Using physical numbering in a rankfile

2012-02-02 Thread teng ma
Just remove p in your rankfile like

rank 0=host1 slot=0:0
rank 1=host1 slot=0:2
rank 2=host1 slot=0:4
rank 3=host1 slot=0:6
rank 4=host1 slot=1:1
rank 5=host1 slot=1:3
rank 6=host1 slot=1:5
rank 7=host1 slot=1:7

Teng

2012/2/2 François Tessier 

>  Hello,
>
> I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic
> architecture. I'm using a node for which lstopo returns that :
>
> 
> Machine (24GB)
>   NUMANode L#0 (P#0 12GB)
> Socket L#0 + L3 L#0 (8192KB)
>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2)
>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4)
>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
> HostBridge L#0
>   PCIBridge
> PCI 8086:10c9
>   Net L#0 "eth0"
> PCI 8086:10c9
>   Net L#1 "eth1"
>   PCIBridge
> PCI 15b3:673c
>   Net L#2 "ib0"
>   Net L#3 "ib1"
>   OpenFabrics L#4 "mlx4_0"
>   PCIBridge
> PCI 102b:0522
>   PCI 8086:3a22
> Block L#5 "sda"
> Block L#6 "sdb"
> Block L#7 "sdc"
> Block L#8 "sdd"
>   NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB)
> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3)
> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5)
> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
> 
>
> And I would like to use the physical numbering. To do that, I created a
> rankfile like this :
>
> rank 0=host1 slot=p0:0
> rank 1=host1 slot=p0:2
> rank 2=host1 slot=p0:4
> rank 3=host1 slot=p0:6
> rank 4=host1 slot=p1:1
> rank 5=host1 slot=p1:3
> rank 6=host1 slot=p1:5
> rank 7=host1 slot=p1:7
>
> But when I run my job with "*mpiexec -np 8 --rankfile rankfile ./foo*", I
> encounter this error :
>
> *Specified slot list: p0:4
> Error: Not found
>
> This could mean that a non-existent processor was specified, or
> that the specification had improper syntax.*
>
>
> Do you know what I did wrong?
>
> Best regards,
>
> François
>
> --
> ___
> François TESSIER
> PhD Student at University of Bordeaux
> Tel : 0033.5.24.57.41.52francois.tess...@inria.fr
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


Re: [OMPI users] Strange TCP latency results on Amazon EC2

2012-01-12 Thread teng ma
Is it possible your EC2 cluster has another "unknown" crappy Ethernet
card(e.g. 1Gb
Ethernet card) . For small messages, they go through different paths in
NPtcp or MPI over NPmpi.

Teng Ma

On Thu, Jan 12, 2012 at 10:28 AM, Roberto Rey  wrote:

> Thanks for your reply!
>
> I'm using TCP BTL because I don't have any other option in Amazon with 10
> Gbit Ethernet.
>
> I also tried with MPICH2 1.4 and I got 60 microseconds...so I am very
> confused about it...
>
> Regarding hyperthreading and process binding settings...I am using only
> one MPI process in each node (2 nodes for a clasical ping-pong latency
> benchmark). I don't know how it could affect on this test...but I could try
> anything that anyone suggest to me
>
> 2012/1/12 Jeff Squyres 
>
>> Hi Roberto.
>>
>> We've had strange reports of performance from EC2 before; it's actually
>> been on my to-do list to go check this out in detail.  I made contact with
>> the EC2 folks at Supercomputing late last year.  They've hooked me up with
>> some credits on EC2 to go check out what's happening, but the pent-up email
>> deluge from the Christmas vacation and my travel to the MPI Forum this week
>> prevented me from testing yet.
>>
>> I hope to be able to get time to test Open MPI on EC2 next week and see
>> what's going on.
>>
>> It's very strange to me that Open MPI is getting *better* than raw TCP
>> performance.  I don't have an immediate explanation for that -- if you're
>> using the TCP BTL, then OMPI should be using TCP sockets, just like netpipe
>> and the others.
>>
>> You *might* want to check hyperthreading and process binding settings in
>> all your tests.
>>
>>
>> On Jan 12, 2012, at 7:04 AM, Roberto Rey wrote:
>>
>> > Hi again,
>> >
>> > Today I was trying with another TCP benchmark included in the hpcbench
>> suite, and with a ping-pong test I'm also getting 100us of latency. Then, I
>> tried with netperf and the same result
>> >
>> > So, in summary, I'm measuring TCP latency with messages size between
>> 1-32 bytes:
>> >
>> > Netperf over TCP -> 100us
>> > Netpipe over TCP (NPtcp)-> 100us
>> > HPCbench over TCP-> 100us
>> > Netpipe over OpenMPI (NPmpi) -> 60us
>> > HPCBench over OpenMPI -> 60us
>> >
>> > Any clues?
>> >
>> > Thanks a lot!
>> >
>> > 2012/1/10 Roberto Rey 
>> > Hi,
>> >
>> > I'm running some tests on EC2 cluster instances with 10 Gigabit
>> Ethernet hardware and I'm getting strange latency results with Netpipe and
>> OpenMPI.
>> >
>> > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around 60
>> microseconds for small messages (less than 2kbytes). However, when I run
>> Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger
>> messages everything seems to be OK.
>> >
>> > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI
>> outperforms raw TCP performance for small messages (40us of difference). I
>> also have run the PingPong test from the Intel Media Benchmarks and the
>> latency results for OpenMPI are very similar (60us) to those obtained with
>> NPmpi
>> >
>> > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI  doing any
>> optimization in BTL TCP?
>> >
>> > The results for OpenMPI aren't so good but we must take into account
>> the network virtualization overhead under Xen
>> >
>> > Thanks for your reply
>> >
>> >
>> >
>> > --
>> > Roberto Rey Expósito
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Roberto Rey Expósito
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


Re: [OMPI users] MPI_Allgather problem

2011-12-09 Thread teng ma
I guess your output is from different ranks.   YOu can add rank infor
inside print to tell like follows:

(void) printf("rank %d: gathered[%d].node = %d\n", rank, i,
gathered[i].node);

>From my side, I did not see anything wrong from your code in Open MPI
1.4.3. after I add rank, the output is
rank 5: gathered[0].node = 0
rank 5: gathered[1].node = 1
rank 5: gathered[2].node = 2
rank 5: gathered[3].node = 3
rank 5: gathered[4].node = 4
rank 5: gathered[5].node = 5
rank 3: gathered[0].node = 0
rank 3: gathered[1].node = 1
rank 3: gathered[2].node = 2
rank 3: gathered[3].node = 3
rank 3: gathered[4].node = 4
rank 3: gathered[5].node = 5
rank 1: gathered[0].node = 0
rank 1: gathered[1].node = 1
rank 1: gathered[2].node = 2
rank 1: gathered[3].node = 3
rank 1: gathered[4].node = 4
rank 1: gathered[5].node = 5
rank 0: gathered[0].node = 0
rank 0: gathered[1].node = 1
rank 0: gathered[2].node = 2
rank 0: gathered[3].node = 3
rank 0: gathered[4].node = 4
rank 0: gathered[5].node = 5
rank 4: gathered[0].node = 0
rank 4: gathered[1].node = 1
rank 4: gathered[2].node = 2
rank 4: gathered[3].node = 3
rank 4: gathered[4].node = 4
rank 4: gathered[5].node = 5
rank 2: gathered[0].node = 0
rank 2: gathered[1].node = 1
rank 2: gathered[2].node = 2
rank 2: gathered[3].node = 3
rank 2: gathered[4].node = 4
rank 2: gathered[5].node = 5

Is that what you expected?

On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully wrote:

> Dear all,
>
> I have not used OpenMPI much before, but am maintaining a large legacy
> application. We noticed a bug to do with a call to MPI_Allgather as
> summarised in this post to Stackoverflow:
> http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results
>
> In the process of looking further into the problem, I noticed that the
> following function results in strange behaviour.
>
> void test_all_gather() {
>
> struct _TEST_ALL_GATHER {
> int node;
> };
>
> int ierr, size, rank;
> ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
> ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>
> struct _TEST_ALL_GATHER local;
> struct _TEST_ALL_GATHER *gathered;
>
> gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered));
>
> local.node = rank;
>
> MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE,
> MPI_COMM_WORLD);
>
> int i;
> for (i = 0; i < numnodes; ++i) {
> (void) printf("gathered[%d].node = %d\n", i, gathered[i].node);
> }
>
> FREE(gathered);
> }
>
> At one point, this function printed the following:
> gathered[0].node = 2
> gathered[1].node = 3
> gathered[2].node = 2
> gathered[3].node = 3
> gathered[4].node = 4
> gathered[5].node = 5
>
> Can anyone suggest a place to start looking into why this might be
> happening? There is a section of the code that calls MPI_Comm_split, but I
> am not sure if that is related...
>
> Running on Ubuntu 11.10 and a summary of ompi_info:
> Package: Open MPI buildd@allspice Distribution
> Open MPI: 1.4.3
> Open MPI SVN revision: r23834
> Open MPI release date: Oct 05, 2010
> Open RTE: 1.4.3
> Open RTE SVN revision: r23834
> Open RTE release date: Oct 05, 2010
> OPAL: 1.4.3
> OPAL SVN revision: r23834
> OPAL release date: Oct 05, 2010
> Ident string: 1.4.3
> Prefix: /usr
> Configured architecture: x86_64-pc-linux-gnu
> Configure host: allspice
> Configured by: buildd
>
> Thanks!
> Brett
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |


Re: [OMPI users] qp memory allocation problem

2011-09-12 Thread Teng Ma
quot; system
>> call or not (0 = no, 1 = yes).
>>   Note that this value does NOT indicate whether
>> the system being run on supports "fork()" with
>> OpenFabrics applications
>>   or not.
>>  MCA btl: parameter "btl_openib_exclusivity" (current
>> value: "1024", data source: default value)
>>   BTL exclusivity (must be >= 0)
>>  MCA btl: parameter "btl_openib_flags" (current value:
>> "310", data source: default value)
>>   BTL bit flags (general flags: SEND=1, PUT=2,
>> GET=4, SEND_INPLACE=8, RDMA_MATCHED=64,
>> HETEROGENEOUS_RDMA=256; flags
>>   only used by the "dr" PML (ignored by others):
>> ACK=16, CHECKSUM=32, RDMA_COMPLETION=128)
>>  MCA btl: parameter "btl_openib_rndv_eager_limit"
>> (current value: "12288", data source: default value)
>>   Size (in bytes) of "phase 1" fragment sent for
>> all large messages (must be >= 0 and <=
>> eager_limit)
>>  MCA btl: parameter "btl_openib_eager_limit" (current
>> value: "12288", data source: default value)
>>   Maximum size (in bytes) of "short" messages
>> (must be >= 1).
>>  MCA btl: parameter "btl_openib_max_send_size" (current
>> value: "65536", data source: default value)
>>   Maximum size (in bytes) of a single "phase 2"
>> fragment of a long message when using the
>> pipeline protocol (must be >=
>>   1)
>>  MCA btl: parameter
>> "btl_openib_rdma_pipeline_send_length" (current value:
>> "1048576", data source: default value)
>>   Length of the "phase 2" portion of a large
>> message (in bytes) when using the pipeline
>> protocol.  This part of the
>>   message will be split into fragments of size
>> max_send_size and sent using send/receive
>> semantics (must be >= 0; only
>>   relevant when the PUT flag is set)
>>  MCA btl: parameter "btl_openib_rdma_pipeline_frag_size"
>> (current value: "1048576", data source: default value)
>>   Maximum size (in bytes) of a single "phase 3"
>> fragment from a long message when using the
>> pipeline protocol.  These
>>   fragments will be sent using RDMA semantics
>> (must be >= 1; only relevant when the PUT flag
>> is set)
>>  MCA btl: parameter "btl_openib_min_rdma_pipeline_size"
>> (current value: "262144", data source: default value)
>>   Messages smaller than this size (in bytes)
>> will not use the RDMA pipeline protocol.
>> Instead, they will be split into
>>   fragments of max_send_size and sent using
>> send/receive semantics (must be >=0, and is
>> automatically adjusted up to at
>>   least
>> (eager_limit+btl_rdma_pipeline_send_length);
>> only relevant when the PUT flag is set)
>>  MCA btl: parameter "btl_openib_bandwidth" (current
>> value: "800", data source: default value)
>>   Approximate maximum bandwidth of
>> interconnect(must be >= 1)
>>  MCA btl: parameter "btl_openib_latency" (current value:
>> "10", data source: default value)
>>   Approximate latency of interconnect (must be
>> >= 0)
>>  MCA btl: parameter "btl_openib_receive_queues" (current
>> value:
>>   
>> "P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32",
>> data source: default value)
>>   Colon-delimited, comma delimited list of
>> receive queues: P,4096,8,6,4:P,32768,8,6,4
>>  MCA btl: parameter "btl_openib_if_include" (current
>> value: , data source: default value)
>>   Comma-delimited list of devices/ports to be
>> used (e.g. "mthca0,mthca1:2"; empty value
>> means to use all ports found).
>>       Mutually exclusive with btl_openib_if_exclude.
>>  MCA btl: param

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
If barrier/time/barrier/time solves your problem in each measure, that
means your computation above/below your barrier is not too "synchronized".
Their overhead is diverse for each process.  on 2nd/3rd/... round, the
time to enter barrier is too diverse, maybe range from [1, 1400]. This
Barrier becomes a huge overhead in your code.

Teng

> do
>
> barrier/time/barrier/time
>
> and run your code again.
>
> Teng
>> I will check that, but as I said in first email, this strange behaviour
>> happens only in one place in my code.
>> I have the same time/barrier/time procedure in other places (in the same
>> code) and it works perfectly.
>>
>> At one place I have the following output (sorted)
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier   1.0
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier   1.0
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  14.2
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  16.3
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  25.1
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  28.4
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  32.6
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  35.3
>> .
>> .
>> .
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  90.1
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  96.3
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  99.5
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3
>> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3
>>
>> but in the place that concerns me I have (sorted)
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1386.9
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1401.5
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1412.9
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1414.1
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1419.6
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1428.1
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1430.4
>> .
>> .
>> .
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1632.7
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1635.7
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1660.6
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1685.1
>> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1699.2
>>
>>
>> These are the same units...
>> You see that in the first place, the "time" to "hit/wait/leave" can be
>> very small compared to the last output...
>>
>>
>> Le 8 sept. 2011 à 16:35, Teng Ma a écrit :
>>
>>> You'd better check process-core binding in your case.  It looks to me
>>> P0
>>> and P1 on the same node and P2 on another node, which makes ack to
>>> P0/P1
>>> go through share memory and ack to P2 through networking.
>>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet
>>> latency is about 20-30 microsec.
>>>
>>> Just my guess..
>>>
>>> Teng
>>>> Thanks,
>>>>
>>>> I understand this but the delays that I measure are huge compared to a
>>>> classical ack procedure... (1000x more)
>>>> And this is repeatable: as far as I understand it, this shows that the
>>>> network is not involved.
>>>>
>>>> Ghislain.
>>>>
>>>>
>>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>>>>
>>>>> I guess you forget to count the "leaving time"(fan-out).  When
>>>>> everyone
>>>>> hits the barrier, it still needs "ack" to leave.  And remember in
>>>>> most
>>>>> cases, leader process will send out "acks" in a sequence way.  It's
>>>>> very
>>>>> possible:
>>>>>
>>>>> P0 barrier time = 29 + send/recv ack 0
>>>>> P1 barrier time = 14 + send ack 0  + send/recv ack 1
>>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>>>>
>>>>> That's your measure time.
>>>>>
>>>>> Teng
>>>>>> This problem as nothing to do with stdout...
>>>>>>
>>>>>> Example with 3 processes:
>>>>>>
>>>>>> P0 hits barrier at t=12
>>>>>> P1 hits barrier at t=27
>>>>>> P2 hits barrier at t=41
>>>>>>
>>>>>> In this situation:
>>>&g

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
do

barrier/time/barrier/time

and run your code again.

Teng
> I will check that, but as I said in first email, this strange behaviour
> happens only in one place in my code.
> I have the same time/barrier/time procedure in other places (in the same
> code) and it works perfectly.
>
> At one place I have the following output (sorted)
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier   1.0
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier   1.0
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  14.2
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  16.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  25.1
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  28.4
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  32.6
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  35.3
> .
> .
> .
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  90.1
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  96.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier  99.5
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3
> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3
>
> but in the place that concerns me I have (sorted)
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1386.9
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1401.5
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1412.9
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1414.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1419.6
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1428.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1430.4
> .
> .
> .
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1632.7
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1635.7
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1660.6
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1685.1
> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1699.2
>
>
> These are the same units...
> You see that in the first place, the "time" to "hit/wait/leave" can be
> very small compared to the last output...
>
>
> Le 8 sept. 2011 à 16:35, Teng Ma a écrit :
>
>> You'd better check process-core binding in your case.  It looks to me P0
>> and P1 on the same node and P2 on another node, which makes ack to P0/P1
>> go through share memory and ack to P2 through networking.
>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet
>> latency is about 20-30 microsec.
>>
>> Just my guess..
>>
>> Teng
>>> Thanks,
>>>
>>> I understand this but the delays that I measure are huge compared to a
>>> classical ack procedure... (1000x more)
>>> And this is repeatable: as far as I understand it, this shows that the
>>> network is not involved.
>>>
>>> Ghislain.
>>>
>>>
>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>>>
>>>> I guess you forget to count the "leaving time"(fan-out).  When
>>>> everyone
>>>> hits the barrier, it still needs "ack" to leave.  And remember in most
>>>> cases, leader process will send out "acks" in a sequence way.  It's
>>>> very
>>>> possible:
>>>>
>>>> P0 barrier time = 29 + send/recv ack 0
>>>> P1 barrier time = 14 + send ack 0  + send/recv ack 1
>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>>>
>>>> That's your measure time.
>>>>
>>>> Teng
>>>>> This problem as nothing to do with stdout...
>>>>>
>>>>> Example with 3 processes:
>>>>>
>>>>> P0 hits barrier at t=12
>>>>> P1 hits barrier at t=27
>>>>> P2 hits barrier at t=41
>>>>>
>>>>> In this situation:
>>>>> P0 waits 41-12 = 29
>>>>> P1 waits 41-27 = 14
>>>>> P2 waits 41-41 = 00
>>>>
>>>>
>>>>
>>>>> So I should see something  like (no ordering is expected):
>>>>> barrier_time = 14
>>>>> barrier_time = 00
>>>>> barrier_time = 29
>>>>>
>>>>> But what I see is much more like
>>>>> barrier_time = 22
>>>>> barrier_time = 29
>>>>> barrier_time = 25
>>>>>
>>>>> See? No process has a barrier_time equal to zero !!!
>>>>>
>>>>>
>>>>>
>>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>>>
>

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
You'd better check process-core binding in your case.  It looks to me P0
and P1 on the same node and P2 on another node, which makes ack to P0/P1
go through share memory and ack to P2 through networking.
1000x is very possible. sm latency can be about 0.03microsec. ethernet
latency is about 20-30 microsec.

Just my guess..

Teng
> Thanks,
>
> I understand this but the delays that I measure are huge compared to a
> classical ack procedure... (1000x more)
> And this is repeatable: as far as I understand it, this shows that the
> network is not involved.
>
> Ghislain.
>
>
> Le 8 sept. 2011 à 16:16, Teng Ma a écrit :
>
>> I guess you forget to count the "leaving time"(fan-out).  When everyone
>> hits the barrier, it still needs "ack" to leave.  And remember in most
>> cases, leader process will send out "acks" in a sequence way.  It's very
>> possible:
>>
>> P0 barrier time = 29 + send/recv ack 0
>> P1 barrier time = 14 + send ack 0  + send/recv ack 1
>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2
>>
>> That's your measure time.
>>
>> Teng
>>> This problem as nothing to do with stdout...
>>>
>>> Example with 3 processes:
>>>
>>> P0 hits barrier at t=12
>>> P1 hits barrier at t=27
>>> P2 hits barrier at t=41
>>>
>>> In this situation:
>>> P0 waits 41-12 = 29
>>> P1 waits 41-27 = 14
>>> P2 waits 41-41 = 00
>>
>>
>>
>>> So I should see something  like (no ordering is expected):
>>> barrier_time = 14
>>> barrier_time = 00
>>> barrier_time = 29
>>>
>>> But what I see is much more like
>>> barrier_time = 22
>>> barrier_time = 29
>>> barrier_time = 25
>>>
>>> See? No process has a barrier_time equal to zero !!!
>>>
>>>
>>>
>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit :
>>>
>>>> The order in which you see stdout printed from mpirun is not
>>>> necessarily
>>>> reflective of what order things were actually printers.  Remember that
>>>> the stdout from each MPI process needs to flow through at least 3
>>>> processes and potentially across the network before it is actually
>>>> displayed on mpirun's stdout.
>>>>
>>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's
>>>> stdout
>>>>
>>>> Hence, the ordering of stdout can get transposed.
>>>>
>>>>
>>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote:
>>>>
>>>>> Thank you for this explanation but indeed this confirms that the LAST
>>>>> process that hits the barrier should go through nearly
>>>>> instantaneously
>>>>> (except for the broadcast time for the acknowledgment signal).
>>>>> And this is not what happens in my code : EVERY process waits for a
>>>>> very long time before going through the barrier (thousands of times
>>>>> more than a broadcast)...
>>>>>
>>>>>
>>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit :
>>>>>
>>>>>> Order in which processes hit the barrier is only one factor in the
>>>>>> time it takes for that process to finish the barrier.
>>>>>>
>>>>>> An easy way to think of a barrier implementation is a "fan in/fan
>>>>>> out"
>>>>>> model.  When each nonzero rank process calls MPI_BARRIER, it sends a
>>>>>> message saying "I have hit the barrier!" (it usually sends it to its
>>>>>> parent in a tree of all MPI processes in the communicator, but you
>>>>>> can
>>>>>> simplify this model and consider that it sends it to rank 0).  Rank
>>>>>> 0
>>>>>> collects all of these messages.  When it has messages from all
>>>>>> processes in the communicator, it sends out "ok, you can leave the
>>>>>> barrier now" messages (again, it's usually via a tree distribution,
>>>>>> but you can pretend that it directly, linearly sends a message to
>>>>>> each
>>>>>> peer process in the communicator).
>>>>>>
>>>>>> Hence, the time that any individual process spends in the
>>>>>> communicator
>>>>>> is relative to when every other process ent

Re: [OMPI users] Problem with MPI_BARRIER

2011-09-08 Thread Teng Ma
;>> PS:
>>>>> This small code behaves perfectly in other parts of my code...
>>>>>
>>>>>
>>>>> _______
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


| Teng Ma  Univ. of Tennessee |
| t...@cs.utk.eduKnoxville, TN |
| http://web.eecs.utk.edu/~tma/   |