Re: [OMPI users] Using physical numbering in a rankfile
I made a mistake in the previous reply. You can use two ways here like: rank 0=host1 slot=0 rank 1=host1 slot=2 rank 2=host1 slot=4 rank 3=host1 slot=6 rank 4=host1 slot=1 rank 5=host1 slot=3 rank 6=host1 slot=5 rank 7=host1 slot=7 or rank 0=host1 slot=0:0 rank 1=host1 slot=0:1 rank 2=host1 slot=0:2 rank 3=host1 slot=0:3 rank 4=host1 slot=1:0 rank 5=host1 slot=1:1 rank 6=host1 slot=1:2 rank 7=host1 slot=1:3 Teng On Thu, Feb 2, 2012 at 12:17 PM, teng ma wrote: > Just remove p in your rankfile like > > rank 0=host1 slot=0:0 > rank 1=host1 slot=0:2 > rank 2=host1 slot=0:4 > rank 3=host1 slot=0:6 > rank 4=host1 slot=1:1 > rank 5=host1 slot=1:3 > rank 6=host1 slot=1:5 > rank 7=host1 slot=1:7 > > Teng > > 2012/2/2 François Tessier > >> Hello, >> >> I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic >> architecture. I'm using a node for which lstopo returns that : >> >> >> Machine (24GB) >> NUMANode L#0 (P#0 12GB) >> Socket L#0 + L3 L#0 (8192KB) >> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) >> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2) >> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4) >> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6) >> HostBridge L#0 >> PCIBridge >> PCI 8086:10c9 >> Net L#0 "eth0" >> PCI 8086:10c9 >> Net L#1 "eth1" >> PCIBridge >> PCI 15b3:673c >> Net L#2 "ib0" >> Net L#3 "ib1" >> OpenFabrics L#4 "mlx4_0" >> PCIBridge >> PCI 102b:0522 >> PCI 8086:3a22 >> Block L#5 "sda" >> Block L#6 "sdb" >> Block L#7 "sdc" >> Block L#8 "sdd" >> NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB) >> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1) >> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3) >> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5) >> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) >> >> >> And I would like to use the physical numbering. To do that, I created a >> rankfile like this : >> >> rank 0=host1 slot=p0:0 >> rank 1=host1 slot=p0:2 >> rank 2=host1 slot=p0:4 >> rank 3=host1 slot=p0:6 >> rank 4=host1 slot=p1:1 >> rank 5=host1 slot=p1:3 >> rank 6=host1 slot=p1:5 >> rank 7=host1 slot=p1:7 >> >> But when I run my job with "*mpiexec -np 8 --rankfile rankfile ./foo*", >> I encounter this error : >> >> *Specified slot list: p0:4 >> Error: Not found >> >> This could mean that a non-existent processor was specified, or >> that the specification had improper syntax.* >> >> >> Do you know what I did wrong? >> >> Best regards, >> >> François >> >> -- >> ___ >> François TESSIER >> PhD Student at University of Bordeaux >> Tel : 0033.5.24.57.41.52francois.tess...@inria.fr >> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > | Teng Ma Univ. of Tennessee | > | t...@cs.utk.eduKnoxville, TN | > | http://web.eecs.utk.edu/~tma/ | > -- | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |
Re: [OMPI users] Using physical numbering in a rankfile
Just remove p in your rankfile like rank 0=host1 slot=0:0 rank 1=host1 slot=0:2 rank 2=host1 slot=0:4 rank 3=host1 slot=0:6 rank 4=host1 slot=1:1 rank 5=host1 slot=1:3 rank 6=host1 slot=1:5 rank 7=host1 slot=1:7 Teng 2012/2/2 François Tessier > Hello, > > I need to use a rankfile with openMPI 1.5.4 to do some tests on a basic > architecture. I'm using a node for which lstopo returns that : > > > Machine (24GB) > NUMANode L#0 (P#0 12GB) > Socket L#0 + L3 L#0 (8192KB) > L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) > L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#2) > L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#4) > L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6) > HostBridge L#0 > PCIBridge > PCI 8086:10c9 > Net L#0 "eth0" > PCI 8086:10c9 > Net L#1 "eth1" > PCIBridge > PCI 15b3:673c > Net L#2 "ib0" > Net L#3 "ib1" > OpenFabrics L#4 "mlx4_0" > PCIBridge > PCI 102b:0522 > PCI 8086:3a22 > Block L#5 "sda" > Block L#6 "sdb" > Block L#7 "sdc" > Block L#8 "sdd" > NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (8192KB) > L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1) > L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#3) > L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#5) > L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7) > > > And I would like to use the physical numbering. To do that, I created a > rankfile like this : > > rank 0=host1 slot=p0:0 > rank 1=host1 slot=p0:2 > rank 2=host1 slot=p0:4 > rank 3=host1 slot=p0:6 > rank 4=host1 slot=p1:1 > rank 5=host1 slot=p1:3 > rank 6=host1 slot=p1:5 > rank 7=host1 slot=p1:7 > > But when I run my job with "*mpiexec -np 8 --rankfile rankfile ./foo*", I > encounter this error : > > *Specified slot list: p0:4 > Error: Not found > > This could mean that a non-existent processor was specified, or > that the specification had improper syntax.* > > > Do you know what I did wrong? > > Best regards, > > François > > -- > ___ > François TESSIER > PhD Student at University of Bordeaux > Tel : 0033.5.24.57.41.52francois.tess...@inria.fr > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |
Re: [OMPI users] Strange TCP latency results on Amazon EC2
Is it possible your EC2 cluster has another "unknown" crappy Ethernet card(e.g. 1Gb Ethernet card) . For small messages, they go through different paths in NPtcp or MPI over NPmpi. Teng Ma On Thu, Jan 12, 2012 at 10:28 AM, Roberto Rey wrote: > Thanks for your reply! > > I'm using TCP BTL because I don't have any other option in Amazon with 10 > Gbit Ethernet. > > I also tried with MPICH2 1.4 and I got 60 microseconds...so I am very > confused about it... > > Regarding hyperthreading and process binding settings...I am using only > one MPI process in each node (2 nodes for a clasical ping-pong latency > benchmark). I don't know how it could affect on this test...but I could try > anything that anyone suggest to me > > 2012/1/12 Jeff Squyres > >> Hi Roberto. >> >> We've had strange reports of performance from EC2 before; it's actually >> been on my to-do list to go check this out in detail. I made contact with >> the EC2 folks at Supercomputing late last year. They've hooked me up with >> some credits on EC2 to go check out what's happening, but the pent-up email >> deluge from the Christmas vacation and my travel to the MPI Forum this week >> prevented me from testing yet. >> >> I hope to be able to get time to test Open MPI on EC2 next week and see >> what's going on. >> >> It's very strange to me that Open MPI is getting *better* than raw TCP >> performance. I don't have an immediate explanation for that -- if you're >> using the TCP BTL, then OMPI should be using TCP sockets, just like netpipe >> and the others. >> >> You *might* want to check hyperthreading and process binding settings in >> all your tests. >> >> >> On Jan 12, 2012, at 7:04 AM, Roberto Rey wrote: >> >> > Hi again, >> > >> > Today I was trying with another TCP benchmark included in the hpcbench >> suite, and with a ping-pong test I'm also getting 100us of latency. Then, I >> tried with netperf and the same result >> > >> > So, in summary, I'm measuring TCP latency with messages size between >> 1-32 bytes: >> > >> > Netperf over TCP -> 100us >> > Netpipe over TCP (NPtcp)-> 100us >> > HPCbench over TCP-> 100us >> > Netpipe over OpenMPI (NPmpi) -> 60us >> > HPCBench over OpenMPI -> 60us >> > >> > Any clues? >> > >> > Thanks a lot! >> > >> > 2012/1/10 Roberto Rey >> > Hi, >> > >> > I'm running some tests on EC2 cluster instances with 10 Gigabit >> Ethernet hardware and I'm getting strange latency results with Netpipe and >> OpenMPI. >> > >> > If I run Netpipe over OpenMPI (NPmpi) I get a network latency around 60 >> microseconds for small messages (less than 2kbytes). However, when I run >> Netpipe over TCP (NPtcp) I always get around 100 microseconds. For bigger >> messages everything seems to be OK. >> > >> > I'm using the BTL TCP in OpenMPI, so I can't understand why OpenMPI >> outperforms raw TCP performance for small messages (40us of difference). I >> also have run the PingPong test from the Intel Media Benchmarks and the >> latency results for OpenMPI are very similar (60us) to those obtained with >> NPmpi >> > >> > Can OpenMPI outperform Netpipe over TCP? Why? Is OpenMPI doing any >> optimization in BTL TCP? >> > >> > The results for OpenMPI aren't so good but we must take into account >> the network virtualization overhead under Xen >> > >> > Thanks for your reply >> > >> > >> > >> > -- >> > Roberto Rey Expósito >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Roberto Rey Expósito > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |
Re: [OMPI users] MPI_Allgather problem
I guess your output is from different ranks. YOu can add rank infor inside print to tell like follows: (void) printf("rank %d: gathered[%d].node = %d\n", rank, i, gathered[i].node); >From my side, I did not see anything wrong from your code in Open MPI 1.4.3. after I add rank, the output is rank 5: gathered[0].node = 0 rank 5: gathered[1].node = 1 rank 5: gathered[2].node = 2 rank 5: gathered[3].node = 3 rank 5: gathered[4].node = 4 rank 5: gathered[5].node = 5 rank 3: gathered[0].node = 0 rank 3: gathered[1].node = 1 rank 3: gathered[2].node = 2 rank 3: gathered[3].node = 3 rank 3: gathered[4].node = 4 rank 3: gathered[5].node = 5 rank 1: gathered[0].node = 0 rank 1: gathered[1].node = 1 rank 1: gathered[2].node = 2 rank 1: gathered[3].node = 3 rank 1: gathered[4].node = 4 rank 1: gathered[5].node = 5 rank 0: gathered[0].node = 0 rank 0: gathered[1].node = 1 rank 0: gathered[2].node = 2 rank 0: gathered[3].node = 3 rank 0: gathered[4].node = 4 rank 0: gathered[5].node = 5 rank 4: gathered[0].node = 0 rank 4: gathered[1].node = 1 rank 4: gathered[2].node = 2 rank 4: gathered[3].node = 3 rank 4: gathered[4].node = 4 rank 4: gathered[5].node = 5 rank 2: gathered[0].node = 0 rank 2: gathered[1].node = 1 rank 2: gathered[2].node = 2 rank 2: gathered[3].node = 3 rank 2: gathered[4].node = 4 rank 2: gathered[5].node = 5 Is that what you expected? On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully wrote: > Dear all, > > I have not used OpenMPI much before, but am maintaining a large legacy > application. We noticed a bug to do with a call to MPI_Allgather as > summarised in this post to Stackoverflow: > http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results > > In the process of looking further into the problem, I noticed that the > following function results in strange behaviour. > > void test_all_gather() { > > struct _TEST_ALL_GATHER { > int node; > }; > > int ierr, size, rank; > ierr = MPI_Comm_size(MPI_COMM_WORLD, &size); > ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > struct _TEST_ALL_GATHER local; > struct _TEST_ALL_GATHER *gathered; > > gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered)); > > local.node = rank; > > MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, > gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, > MPI_COMM_WORLD); > > int i; > for (i = 0; i < numnodes; ++i) { > (void) printf("gathered[%d].node = %d\n", i, gathered[i].node); > } > > FREE(gathered); > } > > At one point, this function printed the following: > gathered[0].node = 2 > gathered[1].node = 3 > gathered[2].node = 2 > gathered[3].node = 3 > gathered[4].node = 4 > gathered[5].node = 5 > > Can anyone suggest a place to start looking into why this might be > happening? There is a section of the code that calls MPI_Comm_split, but I > am not sure if that is related... > > Running on Ubuntu 11.10 and a summary of ompi_info: > Package: Open MPI buildd@allspice Distribution > Open MPI: 1.4.3 > Open MPI SVN revision: r23834 > Open MPI release date: Oct 05, 2010 > Open RTE: 1.4.3 > Open RTE SVN revision: r23834 > Open RTE release date: Oct 05, 2010 > OPAL: 1.4.3 > OPAL SVN revision: r23834 > OPAL release date: Oct 05, 2010 > Ident string: 1.4.3 > Prefix: /usr > Configured architecture: x86_64-pc-linux-gnu > Configure host: allspice > Configured by: buildd > > Thanks! > Brett > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |
Re: [OMPI users] qp memory allocation problem
quot; system >> call or not (0 = no, 1 = yes). >> Note that this value does NOT indicate whether >> the system being run on supports "fork()" with >> OpenFabrics applications >> or not. >> MCA btl: parameter "btl_openib_exclusivity" (current >> value: "1024", data source: default value) >> BTL exclusivity (must be >= 0) >> MCA btl: parameter "btl_openib_flags" (current value: >> "310", data source: default value) >> BTL bit flags (general flags: SEND=1, PUT=2, >> GET=4, SEND_INPLACE=8, RDMA_MATCHED=64, >> HETEROGENEOUS_RDMA=256; flags >> only used by the "dr" PML (ignored by others): >> ACK=16, CHECKSUM=32, RDMA_COMPLETION=128) >> MCA btl: parameter "btl_openib_rndv_eager_limit" >> (current value: "12288", data source: default value) >> Size (in bytes) of "phase 1" fragment sent for >> all large messages (must be >= 0 and <= >> eager_limit) >> MCA btl: parameter "btl_openib_eager_limit" (current >> value: "12288", data source: default value) >> Maximum size (in bytes) of "short" messages >> (must be >= 1). >> MCA btl: parameter "btl_openib_max_send_size" (current >> value: "65536", data source: default value) >> Maximum size (in bytes) of a single "phase 2" >> fragment of a long message when using the >> pipeline protocol (must be >= >> 1) >> MCA btl: parameter >> "btl_openib_rdma_pipeline_send_length" (current value: >> "1048576", data source: default value) >> Length of the "phase 2" portion of a large >> message (in bytes) when using the pipeline >> protocol. This part of the >> message will be split into fragments of size >> max_send_size and sent using send/receive >> semantics (must be >= 0; only >> relevant when the PUT flag is set) >> MCA btl: parameter "btl_openib_rdma_pipeline_frag_size" >> (current value: "1048576", data source: default value) >> Maximum size (in bytes) of a single "phase 3" >> fragment from a long message when using the >> pipeline protocol. These >> fragments will be sent using RDMA semantics >> (must be >= 1; only relevant when the PUT flag >> is set) >> MCA btl: parameter "btl_openib_min_rdma_pipeline_size" >> (current value: "262144", data source: default value) >> Messages smaller than this size (in bytes) >> will not use the RDMA pipeline protocol. >> Instead, they will be split into >> fragments of max_send_size and sent using >> send/receive semantics (must be >=0, and is >> automatically adjusted up to at >> least >> (eager_limit+btl_rdma_pipeline_send_length); >> only relevant when the PUT flag is set) >> MCA btl: parameter "btl_openib_bandwidth" (current >> value: "800", data source: default value) >> Approximate maximum bandwidth of >> interconnect(must be >= 1) >> MCA btl: parameter "btl_openib_latency" (current value: >> "10", data source: default value) >> Approximate latency of interconnect (must be >> >= 0) >> MCA btl: parameter "btl_openib_receive_queues" (current >> value: >> >> "P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32", >> data source: default value) >> Colon-delimited, comma delimited list of >> receive queues: P,4096,8,6,4:P,32768,8,6,4 >> MCA btl: parameter "btl_openib_if_include" (current >> value: , data source: default value) >> Comma-delimited list of devices/ports to be >> used (e.g. "mthca0,mthca1:2"; empty value >> means to use all ports found). >> Mutually exclusive with btl_openib_if_exclude. >> MCA btl: param
Re: [OMPI users] Problem with MPI_BARRIER
If barrier/time/barrier/time solves your problem in each measure, that means your computation above/below your barrier is not too "synchronized". Their overhead is diverse for each process. on 2nd/3rd/... round, the time to enter barrier is too diverse, maybe range from [1, 1400]. This Barrier becomes a huge overhead in your code. Teng > do > > barrier/time/barrier/time > > and run your code again. > > Teng >> I will check that, but as I said in first email, this strange behaviour >> happens only in one place in my code. >> I have the same time/barrier/time procedure in other places (in the same >> code) and it works perfectly. >> >> At one place I have the following output (sorted) >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 14.2 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 16.3 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 25.1 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 28.4 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 32.6 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 35.3 >> . >> . >> . >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 90.1 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 96.3 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 99.5 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3 >> <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3 >> >> but in the place that concerns me I have (sorted) >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1386.9 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1401.5 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1412.9 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1414.1 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1419.6 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1428.1 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1430.4 >> . >> . >> . >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1632.7 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1635.7 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1660.6 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1685.1 >> <00>(0) CAST GHOST DATA2 LOOP 2 barrier1699.2 >> >> >> These are the same units... >> You see that in the first place, the "time" to "hit/wait/leave" can be >> very small compared to the last output... >> >> >> Le 8 sept. 2011 à 16:35, Teng Ma a écrit : >> >>> You'd better check process-core binding in your case. It looks to me >>> P0 >>> and P1 on the same node and P2 on another node, which makes ack to >>> P0/P1 >>> go through share memory and ack to P2 through networking. >>> 1000x is very possible. sm latency can be about 0.03microsec. ethernet >>> latency is about 20-30 microsec. >>> >>> Just my guess.. >>> >>> Teng >>>> Thanks, >>>> >>>> I understand this but the delays that I measure are huge compared to a >>>> classical ack procedure... (1000x more) >>>> And this is repeatable: as far as I understand it, this shows that the >>>> network is not involved. >>>> >>>> Ghislain. >>>> >>>> >>>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit : >>>> >>>>> I guess you forget to count the "leaving time"(fan-out). When >>>>> everyone >>>>> hits the barrier, it still needs "ack" to leave. And remember in >>>>> most >>>>> cases, leader process will send out "acks" in a sequence way. It's >>>>> very >>>>> possible: >>>>> >>>>> P0 barrier time = 29 + send/recv ack 0 >>>>> P1 barrier time = 14 + send ack 0 + send/recv ack 1 >>>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2 >>>>> >>>>> That's your measure time. >>>>> >>>>> Teng >>>>>> This problem as nothing to do with stdout... >>>>>> >>>>>> Example with 3 processes: >>>>>> >>>>>> P0 hits barrier at t=12 >>>>>> P1 hits barrier at t=27 >>>>>> P2 hits barrier at t=41 >>>>>> >>>>>> In this situation: >>>&g
Re: [OMPI users] Problem with MPI_BARRIER
do barrier/time/barrier/time and run your code again. Teng > I will check that, but as I said in first email, this strange behaviour > happens only in one place in my code. > I have the same time/barrier/time procedure in other places (in the same > code) and it works perfectly. > > At one place I have the following output (sorted) > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 1.0 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 14.2 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 16.3 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 25.1 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 28.4 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 32.6 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 35.3 > . > . > . > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 90.1 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 96.3 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 99.5 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 101.2 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 119.3 > <00>(0) CAST GHOST DATA1 LOOP 1 barrier 169.3 > > but in the place that concerns me I have (sorted) > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1386.9 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1401.5 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1412.9 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1414.1 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1419.6 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1428.1 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1430.4 > . > . > . > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1632.7 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1635.7 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1660.6 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1685.1 > <00>(0) CAST GHOST DATA2 LOOP 2 barrier1699.2 > > > These are the same units... > You see that in the first place, the "time" to "hit/wait/leave" can be > very small compared to the last output... > > > Le 8 sept. 2011 à 16:35, Teng Ma a écrit : > >> You'd better check process-core binding in your case. It looks to me P0 >> and P1 on the same node and P2 on another node, which makes ack to P0/P1 >> go through share memory and ack to P2 through networking. >> 1000x is very possible. sm latency can be about 0.03microsec. ethernet >> latency is about 20-30 microsec. >> >> Just my guess.. >> >> Teng >>> Thanks, >>> >>> I understand this but the delays that I measure are huge compared to a >>> classical ack procedure... (1000x more) >>> And this is repeatable: as far as I understand it, this shows that the >>> network is not involved. >>> >>> Ghislain. >>> >>> >>> Le 8 sept. 2011 à 16:16, Teng Ma a écrit : >>> >>>> I guess you forget to count the "leaving time"(fan-out). When >>>> everyone >>>> hits the barrier, it still needs "ack" to leave. And remember in most >>>> cases, leader process will send out "acks" in a sequence way. It's >>>> very >>>> possible: >>>> >>>> P0 barrier time = 29 + send/recv ack 0 >>>> P1 barrier time = 14 + send ack 0 + send/recv ack 1 >>>> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2 >>>> >>>> That's your measure time. >>>> >>>> Teng >>>>> This problem as nothing to do with stdout... >>>>> >>>>> Example with 3 processes: >>>>> >>>>> P0 hits barrier at t=12 >>>>> P1 hits barrier at t=27 >>>>> P2 hits barrier at t=41 >>>>> >>>>> In this situation: >>>>> P0 waits 41-12 = 29 >>>>> P1 waits 41-27 = 14 >>>>> P2 waits 41-41 = 00 >>>> >>>> >>>> >>>>> So I should see something like (no ordering is expected): >>>>> barrier_time = 14 >>>>> barrier_time = 00 >>>>> barrier_time = 29 >>>>> >>>>> But what I see is much more like >>>>> barrier_time = 22 >>>>> barrier_time = 29 >>>>> barrier_time = 25 >>>>> >>>>> See? No process has a barrier_time equal to zero !!! >>>>> >>>>> >>>>> >>>>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit : >>>>> >
Re: [OMPI users] Problem with MPI_BARRIER
You'd better check process-core binding in your case. It looks to me P0 and P1 on the same node and P2 on another node, which makes ack to P0/P1 go through share memory and ack to P2 through networking. 1000x is very possible. sm latency can be about 0.03microsec. ethernet latency is about 20-30 microsec. Just my guess.. Teng > Thanks, > > I understand this but the delays that I measure are huge compared to a > classical ack procedure... (1000x more) > And this is repeatable: as far as I understand it, this shows that the > network is not involved. > > Ghislain. > > > Le 8 sept. 2011 à 16:16, Teng Ma a écrit : > >> I guess you forget to count the "leaving time"(fan-out). When everyone >> hits the barrier, it still needs "ack" to leave. And remember in most >> cases, leader process will send out "acks" in a sequence way. It's very >> possible: >> >> P0 barrier time = 29 + send/recv ack 0 >> P1 barrier time = 14 + send ack 0 + send/recv ack 1 >> P2 barrier time = 0 + send ack 0 + send ack 1 + send/recv ack 2 >> >> That's your measure time. >> >> Teng >>> This problem as nothing to do with stdout... >>> >>> Example with 3 processes: >>> >>> P0 hits barrier at t=12 >>> P1 hits barrier at t=27 >>> P2 hits barrier at t=41 >>> >>> In this situation: >>> P0 waits 41-12 = 29 >>> P1 waits 41-27 = 14 >>> P2 waits 41-41 = 00 >> >> >> >>> So I should see something like (no ordering is expected): >>> barrier_time = 14 >>> barrier_time = 00 >>> barrier_time = 29 >>> >>> But what I see is much more like >>> barrier_time = 22 >>> barrier_time = 29 >>> barrier_time = 25 >>> >>> See? No process has a barrier_time equal to zero !!! >>> >>> >>> >>> Le 8 sept. 2011 à 14:55, Jeff Squyres a écrit : >>> >>>> The order in which you see stdout printed from mpirun is not >>>> necessarily >>>> reflective of what order things were actually printers. Remember that >>>> the stdout from each MPI process needs to flow through at least 3 >>>> processes and potentially across the network before it is actually >>>> displayed on mpirun's stdout. >>>> >>>> MPI process -> local Open MPI daemon -> mpirun -> printed to mpirun's >>>> stdout >>>> >>>> Hence, the ordering of stdout can get transposed. >>>> >>>> >>>> On Sep 8, 2011, at 8:49 AM, Ghislain Lartigue wrote: >>>> >>>>> Thank you for this explanation but indeed this confirms that the LAST >>>>> process that hits the barrier should go through nearly >>>>> instantaneously >>>>> (except for the broadcast time for the acknowledgment signal). >>>>> And this is not what happens in my code : EVERY process waits for a >>>>> very long time before going through the barrier (thousands of times >>>>> more than a broadcast)... >>>>> >>>>> >>>>> Le 8 sept. 2011 à 14:26, Jeff Squyres a écrit : >>>>> >>>>>> Order in which processes hit the barrier is only one factor in the >>>>>> time it takes for that process to finish the barrier. >>>>>> >>>>>> An easy way to think of a barrier implementation is a "fan in/fan >>>>>> out" >>>>>> model. When each nonzero rank process calls MPI_BARRIER, it sends a >>>>>> message saying "I have hit the barrier!" (it usually sends it to its >>>>>> parent in a tree of all MPI processes in the communicator, but you >>>>>> can >>>>>> simplify this model and consider that it sends it to rank 0). Rank >>>>>> 0 >>>>>> collects all of these messages. When it has messages from all >>>>>> processes in the communicator, it sends out "ok, you can leave the >>>>>> barrier now" messages (again, it's usually via a tree distribution, >>>>>> but you can pretend that it directly, linearly sends a message to >>>>>> each >>>>>> peer process in the communicator). >>>>>> >>>>>> Hence, the time that any individual process spends in the >>>>>> communicator >>>>>> is relative to when every other process ent
Re: [OMPI users] Problem with MPI_BARRIER
;>> PS: >>>>> This small code behaves perfectly in other parts of my code... >>>>> >>>>> >>>>> _______ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > | Teng Ma Univ. of Tennessee | | t...@cs.utk.eduKnoxville, TN | | http://web.eecs.utk.edu/~tma/ |