Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Apr 22, 2010, at 10:08 AM, Rainer Keller wrote: Hello Oliver, thanks for the update. Just my $0.02: the upcoming Open MPI v1.5 will warn users, if their session directory is on NFS (or Lustre). ... or panfs :-) Samuel K. Gutierrez Best regards, Rainer On Thursday 22 April 2010 11:37:48 am Oliver Geisler wrote: To sum up and give an update: The extended communication times while using shared memory communication of openmpi processes are caused by openmpi session directory laying on the network via NFS. The problem is resolved by establishing on each diskless node a ramdisk or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to point to the according mountpoint shared memory communication and its files are kept local, thus decreasing the communication times by magnitudes. The relation of the problem to the kernel version is not really resolved, but maybe not "the problem" in this respect. My benchmark is now running fine on a single node with 4 CPU, kernel 2.6.33.1 and openmpi 1.4.1. Running on multiple nodes I experience still higher (TCP) communication times than I would expect. But that requires me some more deep researching the issue (e.g. collisions on the network) and should probably posted to a new thread. Thank you guys for your help. oli -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Hello Oliver, thanks for the update. Just my $0.02: the upcoming Open MPI v1.5 will warn users, if their session directory is on NFS (or Lustre). Best regards, Rainer On Thursday 22 April 2010 11:37:48 am Oliver Geisler wrote: > To sum up and give an update: > > The extended communication times while using shared memory communication > of openmpi processes are caused by openmpi session directory laying on > the network via NFS. > > The problem is resolved by establishing on each diskless node a ramdisk > or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to > point to the according mountpoint shared memory communication and its > files are kept local, thus decreasing the communication times by > magnitudes. > > The relation of the problem to the kernel version is not really > resolved, but maybe not "the problem" in this respect. > My benchmark is now running fine on a single node with 4 CPU, kernel > 2.6.33.1 and openmpi 1.4.1. > Running on multiple nodes I experience still higher (TCP) communication > times than I would expect. But that requires me some more deep > researching the issue (e.g. collisions on the network) and should > probably posted to a new thread. > > Thank you guys for your help. > > oli > -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Oliver, Thank you for this summary insight. This substantially affects the structural design of software implementations, which points to a new analysis "opportunity" in our software. Ken Lloyd -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Oliver Geisler Sent: Thursday, April 22, 2010 9:38 AM To: Open MPI Developers Subject: Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times To sum up and give an update: The extended communication times while using shared memory communication of openmpi processes are caused by openmpi session directory laying on the network via NFS. The problem is resolved by establishing on each diskless node a ramdisk or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to point to the according mountpoint shared memory communication and its files are kept local, thus decreasing the communication times by magnitudes. The relation of the problem to the kernel version is not really resolved, but maybe not "the problem" in this respect. My benchmark is now running fine on a single node with 4 CPU, kernel 2.6.33.1 and openmpi 1.4.1. Running on multiple nodes I experience still higher (TCP) communication times than I would expect. But that requires me some more deep researching the issue (e.g. collisions on the network) and should probably posted to a new thread. Thank you guys for your help. oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
To sum up and give an update: The extended communication times while using shared memory communication of openmpi processes are caused by openmpi session directory laying on the network via NFS. The problem is resolved by establishing on each diskless node a ramdisk or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to point to the according mountpoint shared memory communication and its files are kept local, thus decreasing the communication times by magnitudes. The relation of the problem to the kernel version is not really resolved, but maybe not "the problem" in this respect. My benchmark is now running fine on a single node with 4 CPU, kernel 2.6.33.1 and openmpi 1.4.1. Running on multiple nodes I experience still higher (TCP) communication times than I would expect. But that requires me some more deep researching the issue (e.g. collisions on the network) and should probably posted to a new thread. Thank you guys for your help. oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Monday 19 April 2010, Oliver Geisler wrote: > > Ah, that could do it. Open MPI's shared memory files are under /tmp. So > > if /tmp is NFS, you could get extremely high latencies because of dirty > > page writes out through NFS. > > > > You don't necessarily have to make /tmp disk-full -- if you just make > > OMPI's session directories go into a ramdisk instead of to NFS, that > > should also be sufficient. > > I just browsed FAQ and "ompi_info --param all all", but didn't find the > answer: > How do I set the OMPI session directory to point it to a ramdisk? > > And another question: > What would be a good size for the ram disk? One general value was > supposed by the FAQ with 128MB, but what is your experience? > (maybe a large topic by itself, so I have to try out, I guess) I just wanted to add that space not used on a tmpfs (mount -t tmpfs ...) is not wasted. You can have an 8G tmpfs mounted but if you only use 100M that's how much RAM it uses. /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Ralph Castain wrote: On Apr 19, 2010, at 9:12 AM, Oliver Geisler wrote: Ah, that could do it. Open MPI's shared memory files are under /tmp. So if /tmp is NFS, you could get extremely high latencies because of dirty page writes out through NFS. You don't necessarily have to make /tmp disk-full -- if you just make OMPI's session directories go into a ramdisk instead of to NFS, that should also be sufficient. I just browsed FAQ and "ompi_info --param all all", but didn't find the answer: How do I set the OMPI session directory to point it to a ramdisk? Set the MCA param orte_tmpdir_base to point at the ramdisk using any of the MCA parameter methods (cmd line, envar, default mca param file). I'll add that to http://www.open-mpi.org/faq/?category=sm#where-sm-file And another question: What would be a good size for the ram disk? One general value was supposed by the FAQ with 128MB, but what is your experience? (maybe a large topic by itself, so I have to try out, I guess) I don't recall the default shared memory size per process, but you can get that value from ompi_info --param btl sm. Take that value, multiply by your expected ppn, and then give yourself a fudge factor. Sizing proportionately to the number of processes was a poor heuristic and starting in 1.3.2 we don't employ it any more. In all likelihood, the default size of the shared-memory backing file will be set by mpool_sm_min_size... 64 Mbytes. Try "ompi_info --param mpool sm". There's some other stuff in addition to this backing file... so, need a little fudge factor. Probably 128 MB will be enough for the shared-memory stuff.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Apr 19, 2010, at 9:12 AM, Oliver Geisler wrote: > > >> Ah, that could do it. Open MPI's shared memory files are under /tmp. So if >> /tmp is NFS, you could get extremely high latencies because of dirty page >> writes out through NFS. >> >> You don't necessarily have to make /tmp disk-full -- if you just make OMPI's >> session directories go into a ramdisk instead of to NFS, that should also be >> sufficient. >> > > I just browsed FAQ and "ompi_info --param all all", but didn't find the > answer: > How do I set the OMPI session directory to point it to a ramdisk? > Set the MCA param orte_tmpdir_base to point at the ramdisk using any of the MCA parameter methods (cmd line, envar, default mca param file). > And another question: > What would be a good size for the ram disk? One general value was > supposed by the FAQ with 128MB, but what is your experience? > (maybe a large topic by itself, so I have to try out, I guess) I don't recall the default shared memory size per process, but you can get that value from ompi_info --param btl sm. Take that value, multiply by your expected ppn, and then give yourself a fudge factor. > > Thanks a lot. > > Oli > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
> Ah, that could do it. Open MPI's shared memory files are under /tmp. So if > /tmp is NFS, you could get extremely high latencies because of dirty page > writes out through NFS. > > You don't necessarily have to make /tmp disk-full -- if you just make OMPI's > session directories go into a ramdisk instead of to NFS, that should also be > sufficient. > I just browsed FAQ and "ompi_info --param all all", but didn't find the answer: How do I set the OMPI session directory to point it to a ramdisk? And another question: What would be a good size for the ram disk? One general value was supposed by the FAQ with 128MB, but what is your experience? (maybe a large topic by itself, so I have to try out, I guess) Thanks a lot. Oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Apr 12, 2010, at 11:10 AM, Oliver Geisler wrote: > > Is the /tmp filesystem on NFS by any chance? > > Yes, /tmp is on NFS .. those are diskless nodes all without disks and > no swap space mounted. Ah, that could do it. Open MPI's shared memory files are under /tmp. So if /tmp is NFS, you could get extremely high latencies because of dirty page writes out through NFS. You don't necessarily have to make /tmp disk-full -- if you just make OMPI's session directories go into a ramdisk instead of to NFS, that should also be sufficient. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
In that scenario, you need to set the session directories to point somewhere other than /tmp. I believe you will find that in our FAQs as this has been a recurring problem. The shared memory backing file resides in the session directory tree, so if that is NFS mounted, your performance will stink. People with that setup generally point the session dir at a ramdisk area, but anywhere in ram will do. On Apr 12, 2010, at 9:10 AM, Oliver Geisler wrote: > > Quoting Ashley Pittman : > >> >> On 10 Apr 2010, at 04:51, Eugene Loh wrote: >> >>> Why is shared-memory performance about four orders of magnitude slower than >>> it should be? The processes are communicating via memory that's shared by >>> having the processes all mmap the same file into their address spaces. Is >>> it possible that with the newer kernels, operations to that shared file are >>> going all the way out to disk? Maybe you don't know the answer, but >>> hopefully someone on this mail list can provide some insight. >> >> Is the /tmp filesystem on NFS by any chance? >> > > Yes, /tmp is on NFS .. those are diskless nodes all without disks and no swap > space mounted. > > Maybe I should setup one of the nodes with a disk, so I could try the > difference. > > (Sorry, but I may return results next week since, I am out of office right > now) > > Thanks > oli > > > >> Ashley, >> >> -- >> >> Ashley Pittman, Bath, UK. >> >> Padb - A parallel job inspection tool for cluster computing >> http://padb.pittman.org.uk >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >> > > > > > > > This message was sent using IMP, the Internet Messaging Program. > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Quoting Ashley Pittman : On 10 Apr 2010, at 04:51, Eugene Loh wrote: Why is shared-memory performance about four orders of magnitude slower than it should be? The processes are communicating via memory that's shared by having the processes all mmap the same file into their address spaces. Is it possible that with the newer kernels, operations to that shared file are going all the way out to disk? Maybe you don't know the answer, but hopefully someone on this mail list can provide some insight. Is the /tmp filesystem on NFS by any chance? Yes, /tmp is on NFS .. those are diskless nodes all without disks and no swap space mounted. Maybe I should setup one of the nodes with a disk, so I could try the difference. (Sorry, but I may return results next week since, I am out of office right now) Thanks oli Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. This message was sent using IMP, the Internet Messaging Program. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 10/04/10 06:59, Oliver Geisler wrote: This is the results of skampi pt2pt, first with shared memory allowed, second shared memory excluded. For what it's worth I can't replicate those results on an AMD Shanghai cluster running a 2.6.32 kernel and Open-MPI 1.4.1. Here is what I see (run under Torque, selecting 2 cores on the same node, so no need to specify -np): $ mpirun --mca btl self,sm,tcp ./skampi -i ski/skampi_pt2pt.ski # begin result "Pingpong_Send_Recv" count= 14 2.0 0.0 16 2.0 1.8 count= 28 2.1 0.0 16 2.1 1.8 count= 3 12 2.1 0.18 2.0 2.0 count= 4 16 2.1 0.18 2.0 2.0 count= 6 24 2.0 0.0 16 2.0 1.8 count= 8 32 2.9 0.0 16 2.7 2.4 count= 11 44 2.3 0.1 16 2.2 2.0 count= 16 64 2.2 0.1 16 2.1 2.0 count= 23 92 2.7 0.2 16 2.6 2.1 count= 32 128 2.5 0.1 16 2.5 2.1 count= 45 180 3.0 0.0 16 2.8 2.6 count= 64 256 3.1 0.08 3.0 2.5 count= 91 364 3.1 0.08 3.0 3.0 count= 128 512 3.4 0.2 16 3.3 3.0 count= 181 724 4.1 0.0 16 4.0 4.1 count= 256 1024 5.0 0.08 4.5 4.5 count= 362 1448 6.0 0.0 16 5.8 5.7 count= 512 2048 7.7 0.1 16 7.3 7.6 count= 724 2896 10.0 0.08 10.0 9.8 count= 1024 4096 12.3 0.1 16 12.1 12.0 count= 1448 5792 13.8 0.28 13.5 13.4 count= 2048 8192 18.1 0.0 16 17.9 18.1 count= 289611584 25.0 0.0 16 24.9 25.0 count= 409616384 34.2 0.1 16 34.0 34.2 # end result "Pingpong_Send_Recv" # duration = 0.00 sec mpirun --mca btl tcp,self ./skampi -i ski/skampi_pt2pt.ski # begin result "Pingpong_Send_Recv" count= 14 21.2 1.0 16 20.1 17.8 count= 28 20.8 1.0 16 20.6 16.7 count= 3 12 20.2 0.9 16 19.0 17.1 count= 4 16 19.9 1.0 16 19.0 17.0 count= 6 24 21.1 1.1 16 20.6 17.0 count= 8 32 20.0 1.0 16 18.8 17.1 count= 11 44 20.9 0.8 16 20.0 17.1 count= 16 64 21.7 1.1 16 20.5 17.6 count= 23 92 21.7 1.0 16 20.0 18.5 count= 32 128 21.6 1.0 16 20.5 18.5 count= 45 180 22.0 1.0 16 20.9 19.0 count= 64 256 21.8 0.7 16 20.5 20.2 count= 91 364 20.5 0.3 16 19.8 19.1 count= 128 512 18.5 0.38 17.5 18.1 count= 181 724 19.3 0.28 19.1 19.0 count= 256 1024 20.3 0.3 16 19.7 20.0 count= 362 1448 22.1 0.3 16 21.2 21.4 count= 512 2048 24.2 0.3 16 23.7 23.2 count= 724 2896 24.8 0.58 24.0 24.0 count= 1024 4096 26.8 0.2 16 26.1 26.3 count= 1448 5792 31.6 0.3 16 30.4 31.5 count= 2048 8192 38.0 0.6 16 37.3 37.1 count= 289611584 52.1 1.4 16 49.1 50.8 count= 409616384 93.8 1.1 16 81.1 91.5 # end result "Pingpong_Send_Recv" # duration = 0.02 sec cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 10/04/10 15:12, Bogdan Costescu wrote: Have there been any process scheduler changes in the newer kernels ? Are there ever kernels where that doesn't get tweaked ? ;-) I'm not sure that they could explain four orders of magnitude differences though... One idea that comes to mind would be to run the child processes under strace -c as that will monitor all the system calls and report how long is spent in which. By running a comparison with 2.6.23 and 2.6.24 then you might get a pointer to which syscall(s) are taking longer. Alternatively if you want to get fancy then you could play with doing a git bisection between 2.6.23 and 2.6.24 to track down the commit that introduces the regression. To be honest it'd be interesting to see whether the issue still manifests on a recent kernel though, if so then perhaps we might be able to get the kernel developers interested (though they will likely ask for a bisection too). cheers! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 10 Apr 2010, at 04:51, Eugene Loh wrote: > Why is shared-memory performance about four orders of magnitude slower than > it should be? The processes are communicating via memory that's shared by > having the processes all mmap the same file into their address spaces. Is it > possible that with the newer kernels, operations to that shared file are > going all the way out to disk? Maybe you don't know the answer, but > hopefully someone on this mail list can provide some insight. Is the /tmp filesystem on NFS by any chance? Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Bogdan Costescu wrote: On Sat, Apr 10, 2010 at 5:51 AM, Eugene Loh wrote: Why is shared-memory performance about four orders of magnitude slower than it should be? Have there been any process scheduler changes in the newer kernels ? I'm not sure that they could explain four orders of magnitude differences though... Plus, the TCP numbers seem okay. So, it doesn't feel like that or process-binding issues.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Sat, Apr 10, 2010 at 5:51 AM, Eugene Loh wrote: > Why is shared-memory performance about four orders of magnitude slower than > it should be? Have there been any process scheduler changes in the newer kernels ? I'm not sure that they could explain four orders of magnitude differences though... Cheers, Bogdan
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Oliver Geisler wrote: This is the results of skampi pt2pt, first with shared memory allowed, second shared memory excluded. Thanks for the data. The TCP results are not very interesting... they look reasonable. The shared-memory data is rather straightforward: results are just plain ridiculously bad. The results for "eager" messages (messages shorter than 4Kbytes) are around 12 millisec. The results for "rendezvous" messages (longer than 4 Kbytes, signal the receiver, wait for an acknowledgement, then send the message) are about 30 millisec. I was also curious about "long-message bandwidth", but since SKaMPI is only going up to 16 Kbyte messages, we can't really tell. But, maybe all that is irrelevent. Why is shared-memory performance about four orders of magnitude slower than it should be? The processes are communicating via memory that's shared by having the processes all mmap the same file into their address spaces. Is it possible that with the newer kernels, operations to that shared file are going all the way out to disk? Maybe you don't know the answer, but hopefully someone on this mail list can provide some insight.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Sorry for replying late. Unfortunately I am not "full time administrator". And I am going to be a conference next week, so please be patient with me replying. On 4/7/2010 6:56 PM, Eugene Loh wrote: > Oliver Geisler wrote: > >> Using netpipe and comparing tcp and mpi communication I get the >> following results: >> >> TCP is much faster than MPI, approx. by factor 12 >> >> > Faster? 12x? I don't understand the following: > >> e.g a packet size of 4096 bytes deliveres in >> 97.11 usec with NPtcp and >> 15338.98 usec with NPmpi >> >> > This implies NPtcp is 160x faster than NPmpi. > The ratio function NPtcp/NPmpi has a mean value of factor 60 for small packet sizes <4kB, a maximum of 160 at 4kB (it was a bad value to pick out in the first place), then dropping down to 40 for packet sizes of about 16kB and further dropping below factor 20 for packets larger than 100kB. >> or >> packet size 262kb >> 0.05268801 sec NPtcp >> 0.00254560 sec NPmpi >> >> > This implies NPtcp is 20x slower than NPmpi. > Sorry, my fault ... vice versa, should read: packet size 262kb 0.00254560 sec NPtcp 0.05268801 sec NPmpi >> Further our benchmark started with "--mca btl tcp,self" runs with short >> communication times, even using kernel 2.6.33.1 >> >> Is there a way to see what type of communication is actually selected? >> >> Can anybody imagine why shared memory leads to these problems? >> >> > Okay, so it's a shared-memory performance problem since: > > 1) You get better performance when you exclude sm explicitly with "--mca > btl tcp,self". > 2) You get better performance when you exclude sm by distributing one > process per node (an observation you made relatively early in this thread). > 3) TCP is faster than MPI (which is presumably using sm). > > Can you run a pingpong test as a function of message length for two > processes in a way that demonstrates the problem? For example, if > you're comfortable with SKaMPI, just look at Pingpong_Send_Recv and > let's see what performance looks like as a function of message length. > Maybe this is a short-message-latency problem. This is the results of skampi pt2pt, first with shared memory allowed, second shared memory excluded. It doesn't look to me as the long message times are related to short messages. Including hosts over ethernet results in higher communication times which are equal to those when I ping the host (a hundred+ milliseconds). mpirun --mca btl self,sm,tcp -np 2 ./skampi -i ski/skampi_pt2pt.ski # begin result "Pingpong_Send_Recv" count= 14 12756.0 307.4 16 11555.3 11011.2 count= 289902.8 629.0 169615.48601.0 count= 3 12 12547.5 881.0 16 12233.1 11229.2 count= 4 16 12087.2 829.6 16 11610.6 10478.6 count= 6 24 13634.4 352.1 16 11247.8 12621.9 count= 8 32 13835.8 282.2 16 11091.7 12944.6 count= 11 44 13328.9 864.6 16 12095.6 11977.0 count= 16 64 13195.2 432.3 16 11460.4 10051.9 count= 23 92 13849.3 532.5 16 12476.9 12998.1 count= 32 128 14202.2 436.4 16 11923.8 12977.4 count= 45 180 14026.3 637.7 16 13042.5 12767.8 count= 64 256 13475.8 466.7 16 11720.4 12521.3 count= 91 364 14015.0 406.1 16 13300.4 12881.6 count= 128 512 13481.3 870.6 16 11187.7 12070.6 count= 181 724 10697.1 98.4 16 10697.19520.1 count= 256 1024 14120.8 602.1 16 13988.2 11349.9 count= 362 1448 15718.2 582.3 16 14468.2 12535.2 count= 512 2048 11214.9 749.1 16 11155.09928.5 count= 724 2896 15127.3 186.1 16 15127.3 10974.9 count= 1024 4096 34045.0 692.2 16 32963.6 31728.1 count= 1448 5792 29965.9 788.1 16 27997.8 27404.4 count= 2048 8192 30082.1 785.3 16 28023.9 29538.5 count= 289611584 32556.0 219.4 16 29312.2 32290.4 count= 409616384 24999.8 839.6 16 23422.0 23644.6 # end result "Pingpong_Send_Recv" # duration = 10.15 sec mpirun --mca btl tcp,self -np 2 ./skampi -i ski/skampi_pt2pt.ski # begin result "Pingpong_Send_Recv" count= 14 14.5 0.3 16 13.5 13.2 count= 28 13.5 0.28 12.9 12.4 count= 3 12 13.1 0.4 16 12.7 11.3 count= 4 16 13.9 0.4 16 12.7 13.0 count= 6 24 13.8 0.4 16 12.5 12.8 count= 8 32 13.8 0.4 16 12.7 13.0 count= 11 44 14.0 0.3 16 12.8 13.0 count= 16 64 13.5 0.5 16 12.3 12.4 count= 23 92 13.9 0.4 16 13.1 12.7 count= 32 128 14.8 0.1 16
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Oliver Geisler wrote: Using netpipe and comparing tcp and mpi communication I get the following results: TCP is much faster than MPI, approx. by factor 12 Faster? 12x? I don't understand the following: e.g a packet size of 4096 bytes deliveres in 97.11 usec with NPtcp and 15338.98 usec with NPmpi This implies NPtcp is 160x faster than NPmpi. or packet size 262kb 0.05268801 sec NPtcp 0.00254560 sec NPmpi This implies NPtcp is 20x slower than NPmpi. Further our benchmark started with "--mca btl tcp,self" runs with short communication times, even using kernel 2.6.33.1 Is there a way to see what type of communication is actually selected? Can anybody imagine why shared memory leads to these problems? Okay, so it's a shared-memory performance problem since: 1) You get better performance when you exclude sm explicitly with "--mca btl tcp,self". 2) You get better performance when you exclude sm by distributing one process per node (an observation you made relatively early in this thread). 3) TCP is faster than MPI (which is presumably using sm). Can you run a pingpong test as a function of message length for two processes in a way that demonstrates the problem? For example, if you're comfortable with SKaMPI, just look at Pingpong_Send_Recv and let's see what performance looks like as a function of message length. Maybe this is a short-message-latency problem.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 4/6/2010 5:09 PM, Jeff Squyres wrote: > On Apr 6, 2010, at 6:04 PM, Oliver Geisler wrote: > >> Further our benchmark started with "--mca btl tcp,self" runs with short >> communication times, even using kernel 2.6.33.1 > > I'm not sure what this statement means (^^). Can you explain? > In the first place we witnessed the problem upgrading our hardware and thus had to upgrade the running kernel version in order to get the network cards running. I used a typical application that we use on the cluster (in-house development) to benchmark old vs. new hardware. There I witnessed an performance drop instead of an increase to be expected. Searching for the loss of performance we figured out that the pure computation time on each data packet meets the expected increase due to the accelerated hardware, but communication times between the master and the slave processes increased largely. Furthermore we broke down the problem to kernel versions larger than 2.6.23 (which we could not use, because the network cards aren't supported yet) Now that I run the program with mpirun option "--mca btl tcp,self", I could achieve shortened communication times (and all over completion times as expected), even running on an new node with kernel version 2.6.33.1. >> Is there a way to see what type of communication is actually selected? > > If you "--mca btl tcp,self" is used, then TCP sockets are used for non-self > communications (i.e., communications with peer MPI processes, regardless of > location). > >> Can anybody imagine why shared memory leads to these problems? > > I'm not sure I understand this -- if "--mca btl tcp,self", shared memory is > not used...? > When I use "--mca btl sm,selfm", I get the issue, so my guess is it has to do something with shared memory? > re-reading your email, I'm wondering: did you run the NPmpi process with > "--mca btl tcp,sm,self" (or no --mca btl param)? That might explain some of > my confusion, above. > I ran NPmpi without explicit mca-btl option .. which should default to /etc/openmpi/openmpi-mca-params.conf with btl = self,sm,tcp -- - Oliver Geisler TERRASYS Geophysics 3100 Wilcrest Drive www.terrasysgeo.com Suite 325 Tel: +1-713-893-3630 Houston, TX 77042Fax: +1-713-893-3631 United States e-mail: geis...@terrasysgeo.com - TERRASYS Geophysics USA Inc. UBI#: 602 171 274 15131 Carter Loop SE FEIN: 52-726308 Yelm, WA 98597 - This e-mail contains proprietary information some or all of which may be legally privileged. It is for the intended recipient only. The views expressed in this e-mail may not be official policy, but the personal views of the originator. If an addressing or transmission error has misdirected this e-mail, please notify the author by replying to this e-mail. If you are not the intended recipient you must not use, disclose, distribute, copy, print, or rely on this e-mail. All messages sent and received are monitored for viruses and high risk file extensions. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Apr 6, 2010, at 6:04 PM, Oliver Geisler wrote: > Using netpipe and comparing tcp and mpi communication I get the > following results: > > TCP is much faster than MPI, approx. by factor 12 > e.g a packet size of 4096 bytes deliveres in > 97.11 usec with NPtcp and > 15338.98 usec with NPmpi > or > packet size 262kb > 0.05268801 sec NPtcp > 0.00254560 sec NPmpi Well that's not good (for us). :-\ > Further our benchmark started with "--mca btl tcp,self" runs with short > communication times, even using kernel 2.6.33.1 I'm not sure what this statement means (^^). Can you explain? > Is there a way to see what type of communication is actually selected? If you "--mca btl tcp,self" is used, then TCP sockets are used for non-self communications (i.e., communications with peer MPI processes, regardless of location). > Can anybody imagine why shared memory leads to these problems? I'm not sure I understand this -- if "--mca btl tcp,self", shared memory is not used...? re-reading your email, I'm wondering: did you run the NPmpi process with "--mca btl tcp,sm,self" (or no --mca btl param)? That might explain some of my confusion, above. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 4/6/2010 2:54 PM, Jeff Squyres wrote: > Sorry for the delay -- I just replied on the user list -- I think the first > thing to do is to establish baseline networking performance and see if that > is out of whack. If the underlying network is bad, then MPI performance will > also be bad. > > Using netpipe and comparing tcp and mpi communication I get the following results: TCP is much faster than MPI, approx. by factor 12 e.g a packet size of 4096 bytes deliveres in 97.11 usec with NPtcp and 15338.98 usec with NPmpi or packet size 262kb 0.05268801 sec NPtcp 0.00254560 sec NPmpi Further our benchmark started with "--mca btl tcp,self" runs with short communication times, even using kernel 2.6.33.1 Is there a way to see what type of communication is actually selected? Can anybody imagine why shared memory leads to these problems? Kernel configuration? Thanks, Jeff, for insisting upon testing network performance. Thanks all others, too ;-) oli > On Apr 6, 2010, at 11:51 AM, Oliver Geisler wrote: > >> On 4/6/2010 10:11 AM, Rainer Keller wrote: >>> Hello Oliver, >>> Hmm, this is really a teaser... >>> I haven't seen such a drastic behavior, and haven't read of any on the list. >>> >>> One thing however, that might interfere is process binding. >>> Could You make sure, that processes are not bound to cores (default in >>> 1.4.1): >>> with mpirun --bind-to-none >>> >> >> I have tried version 1.4.1. Using default settings and watched processes >> switching from core to core in "top" (with "f" + "j"). Then I tried >> --bind-to-core and explicitly --bind-to-none. All with the same result: >> ~20% cpu wait and lot longer over-all computation times. >> >> Thanks for the idea ... >> Every input is helpful. >> >> Oli >> >> >>> Just an idea... >>> >>> Regards, >>> Rainer >>> >>> On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote: Hello Devel-List, I am a little bit helpless about this matter. I already posted in the user list. In case you don't read the users list, I post in here. This is the original posting: http://www.open-mpi.org/community/lists/users/2010/03/12474.php Short: Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2 (I know outdated, but in debian stable, and same results with 1.4.1) increases communication times between processes (essentially between one master and several slave processes). This is regardless of whether the processes are local only or communication is over ethernet. Did anybody witness such a behavior? Ideas what should I test for? What additional information should I provide for you? Thanks for your time oli >>> >> >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On Apr 6, 2010, at 4:29 PM, Oliver Geisler wrote: > > Sorry for the delay -- I just replied on the user list -- I think the first > > thing to do is to establish baseline networking performance and see if that > > is out of whack. If the underlying network is bad, then MPI performance > > will also be bad. > > Could make sense. With kernel 2.6.24 it seems a major change in the > modules for Intel PCI-Express network cards was introduced. > Does openmpi use TCP communication, even if all processes are on the > same local node? It depends. :-) The "--mca btl sm,self,tcp" option to mpirun tells Open MPI to use shared memory, tcp, and process-loopback for MPI point-to-point communications. Open MPI computes a reachability / priority map and uses the highest priority plugin that is reachable for each peer MPI process. Meaning that on each node, if you allow "sm" to be used, "sm" should be used for all on-node communications. If you had only said --mca btl tcp,self", then you're only allowing Open MPI to use TCP for all non-self MPI point-to-point communications. The default -- if you don't specify "--mca btl " at all -- is for Open MPI to figure it out automatically and use whatever networks it can find. In your case, I'm guessing that it's pretty much identical to specifying "--mca btl tcp,sm,self". Another good raw TCP performance program that network wonks are familiar with is netperf. NetPipe is nice because it allows an apples-to-apples comparison of TCP and MPI (i.e., it's the same benchmark app that can use either TCP or MPI [or several other] transports underneath). But netperf might be a bit more familiar to those outside the HPC community. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 4/6/2010 2:54 PM, Jeff Squyres wrote: > Sorry for the delay -- I just replied on the user list -- I think the first > thing to do is to establish baseline networking performance and see if that > is out of whack. If the underlying network is bad, then MPI performance will > also be bad. > Could make sense. With kernel 2.6.24 it seems a major change in the modules for Intel PCI-Express network cards was introduced. Does openmpi use TCP communication, even if all processes are on the same local node? > > On Apr 6, 2010, at 11:51 AM, Oliver Geisler wrote: > >> On 4/6/2010 10:11 AM, Rainer Keller wrote: >>> Hello Oliver, >>> Hmm, this is really a teaser... >>> I haven't seen such a drastic behavior, and haven't read of any on the list. >>> >>> One thing however, that might interfere is process binding. >>> Could You make sure, that processes are not bound to cores (default in >>> 1.4.1): >>> with mpirun --bind-to-none >>> >> >> I have tried version 1.4.1. Using default settings and watched processes >> switching from core to core in "top" (with "f" + "j"). Then I tried >> --bind-to-core and explicitly --bind-to-none. All with the same result: >> ~20% cpu wait and lot longer over-all computation times. >> >> Thanks for the idea ... >> Every input is helpful. >> >> Oli >> >> >>> Just an idea... >>> >>> Regards, >>> Rainer >>> >>> On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote: Hello Devel-List, I am a little bit helpless about this matter. I already posted in the user list. In case you don't read the users list, I post in here. This is the original posting: http://www.open-mpi.org/community/lists/users/2010/03/12474.php Short: Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2 (I know outdated, but in debian stable, and same results with 1.4.1) increases communication times between processes (essentially between one master and several slave processes). This is regardless of whether the processes are local only or communication is over ethernet. Did anybody witness such a behavior? Ideas what should I test for? What additional information should I provide for you? Thanks for your time oli >>> >> >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Sorry for the delay -- I just replied on the user list -- I think the first thing to do is to establish baseline networking performance and see if that is out of whack. If the underlying network is bad, then MPI performance will also be bad. On Apr 6, 2010, at 11:51 AM, Oliver Geisler wrote: > On 4/6/2010 10:11 AM, Rainer Keller wrote: > > Hello Oliver, > > Hmm, this is really a teaser... > > I haven't seen such a drastic behavior, and haven't read of any on the list. > > > > One thing however, that might interfere is process binding. > > Could You make sure, that processes are not bound to cores (default in > > 1.4.1): > > with mpirun --bind-to-none > > > > I have tried version 1.4.1. Using default settings and watched processes > switching from core to core in "top" (with "f" + "j"). Then I tried > --bind-to-core and explicitly --bind-to-none. All with the same result: > ~20% cpu wait and lot longer over-all computation times. > > Thanks for the idea ... > Every input is helpful. > > Oli > > > > Just an idea... > > > > Regards, > > Rainer > > > > On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote: > >> Hello Devel-List, > >> > >> I am a little bit helpless about this matter. I already posted in the > >> user list. In case you don't read the users list, I post in here. > >> > >> This is the original posting: > >> > >> http://www.open-mpi.org/community/lists/users/2010/03/12474.php > >> > >> Short: > >> Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2 > >> (I know outdated, but in debian stable, and same results with 1.4.1) > >> increases communication times between processes (essentially between one > >> master and several slave processes). This is regardless of whether the > >> processes are local only or communication is over ethernet. > >> > >> Did anybody witness such a behavior? > >> > >> Ideas what should I test for? > >> > >> What additional information should I provide for you? > >> > >> Thanks for your time > >> > >> oli > >> > > > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
On 4/6/2010 10:11 AM, Rainer Keller wrote: > Hello Oliver, > Hmm, this is really a teaser... > I haven't seen such a drastic behavior, and haven't read of any on the list. > > One thing however, that might interfere is process binding. > Could You make sure, that processes are not bound to cores (default in 1.4.1): > with mpirun --bind-to-none > I have tried version 1.4.1. Using default settings and watched processes switching from core to core in "top" (with "f" + "j"). Then I tried --bind-to-core and explicitly --bind-to-none. All with the same result: ~20% cpu wait and lot longer over-all computation times. Thanks for the idea ... Every input is helpful. Oli > Just an idea... > > Regards, > Rainer > > On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote: >> Hello Devel-List, >> >> I am a little bit helpless about this matter. I already posted in the >> user list. In case you don't read the users list, I post in here. >> >> This is the original posting: >> >> http://www.open-mpi.org/community/lists/users/2010/03/12474.php >> >> Short: >> Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2 >> (I know outdated, but in debian stable, and same results with 1.4.1) >> increases communication times between processes (essentially between one >> master and several slave processes). This is regardless of whether the >> processes are local only or communication is over ethernet. >> >> Did anybody witness such a behavior? >> >> Ideas what should I test for? >> >> What additional information should I provide for you? >> >> Thanks for your time >> >> oli >> > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Hello Oliver, Hmm, this is really a teaser... I haven't seen such a drastic behavior, and haven't read of any on the list. One thing however, that might interfere is process binding. Could You make sure, that processes are not bound to cores (default in 1.4.1): with mpirun --bind-to-none Just an idea... Regards, Rainer On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote: > Hello Devel-List, > > I am a little bit helpless about this matter. I already posted in the > user list. In case you don't read the users list, I post in here. > > This is the original posting: > > http://www.open-mpi.org/community/lists/users/2010/03/12474.php > > Short: > Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2 > (I know outdated, but in debian stable, and same results with 1.4.1) > increases communication times between processes (essentially between one > master and several slave processes). This is regardless of whether the > processes are local only or communication is over ethernet. > > Did anybody witness such a behavior? > > Ideas what should I test for? > > What additional information should I provide for you? > > Thanks for your time > > oli > -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
[OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times
Hello Devel-List, I am a little bit helpless about this matter. I already posted in the user list. In case you don't read the users list, I post in here. This is the original posting: http://www.open-mpi.org/community/lists/users/2010/03/12474.php Short: Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2 (I know outdated, but in debian stable, and same results with 1.4.1) increases communication times between processes (essentially between one master and several slave processes). This is regardless of whether the processes are local only or communication is over ethernet. Did anybody witness such a behavior? Ideas what should I test for? What additional information should I provide for you? Thanks for your time oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.