Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
Hello, Is it possible to change the port number for the MPI communication? I can see that my program uses port 4 for the MPI communication. [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on port 4 [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 134.106.3.252 failed: Connection refused (111) In my case the ports from 1 to 1024 are reserved. MPI tries to use one of the reserve ports and prompts the connection refused error. I will be very glade for the kind suggestions. Regards. On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed wrote: > Hello Jeff, > > Thanks for your cooperation. > > --mca btl_tcp_if_include br0 > > worked out of the box. > > The problem was from the network administrator. The machines on the > network side were halting the mpi... > > so cleaning and killing every thing worked. > > :) > > regards. > > > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> There is no "self" IP interface in the Linux kernel. >> >> Try using btl_tcp_if_include and list just the interface(s) that you want >> to use. From your prior email, I'm *guessing* it's just br2 (i.e., the >> 10.x address inside your cluster). >> >> Also, it looks like you didn't setup your SSH keys properly for logging >> in to remote notes automatically. >> >> >> >> On Mar 24, 2014, at 10:56 AM, Hamid Saeed wrote: >> >> > Hello, >> > >> > I added the "self" e.g >> > >> > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca >> btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv >> > >> > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': >> > >> -- >> > >> > ERROR:: >> > >> > At least one pair of MPI processes are unable to reach each other for >> > MPI communications. This means that no Open MPI device has indicated >> > that it can be used to communicate between these processes. This is >> > an error; Open MPI requires that all MPI processes be able to reach >> > each other. This error can sometimes be the result of forgetting to >> > specify the "self" BTL. >> > >> > Process 1 ([[15751,1],7]) is on host: wirth >> > Process 2 ([[15751,1],0]) is on host: karp >> > BTLs attempted: self sm >> > >> > Your MPI job is now going to abort; sorry. >> > >> -- >> > >> -- >> > MPI_INIT has failed because at least one MPI process is unreachable >> > from another. This *usually* means that an underlying communication >> > plugin -- such as a BTL or an MTL -- has either not loaded or not >> > allowed itself to be used. Your MPI job will now abort. >> > >> > You may wish to try to narrow down the problem; >> > >> > * Check the output of ompi_info to see which BTL/MTL plugins are >> >available. >> > * Run your application with MPI_THREAD_SINGLE. >> > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, >> >if using MTL-based communications) to see exactly which >> >communication plugins were considered and/or discarded. >> > >> -- >> > [wirth:40329] *** An error occurred in MPI_Init >> > [wirth:40329] *** on a NULL communicator >> > [wirth:40329] *** Unknown error >> > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort >> > >> -- >> > An MPI process is aborting at a time when it cannot guarantee that all >> > of its peer processes in the job will be killed properly. You should >> > double check that everything has shut down cleanly. >> > >> > Reason: Before MPI_INIT completed >> > Local host: wirth >> > PID:40329 >> > >> -- >> > >> -- >> > mpirun has exited due to process rank 7 with PID 40329 on >> > node wirth exiting improperly. There are two reasons this could occur: >> > >> > 1. this process did not call "init" before exiting, but others in >> > the job did. This can cause a job to hang indefinitely while it waits >> > for all processes to call "init". By rule, if one process calls "init", >> > then ALL processes must call "init" prior to termination. >> > >> > 2. this process called "init", but exited without calling "finalize". >> > By rule, all processes that call "init" MUST call "finalize" prior to >> > exiting or it will be considered an "abnormal termination" >> > >> > This may have caused other processes in the application to be >> > terminated by signals sent by mpirun (as reported here). >> > >> -- >> > [karp:29513] 1 more process ha
Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
Hi, Am 25.03.2014 um 08:34 schrieb Hamid Saeed: > Is it possible to change the port number for the MPI communication? > > I can see that my program uses port 4 for the MPI communication. > > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on > port 4 > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > In my case the ports from 1 to 1024 are reserved. > MPI tries to use one of the reserve ports and prompts the connection refused > error. > > I will be very glade for the kind suggestions. There are certain parameters to set the range of used ports, but using any up to 1024 should not be the default: http://www.open-mpi.org/community/lists/users/2011/11/17732.php Are any of these set by accident beforehand by your environment? -- Reuti > Regards. > > > > > > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed wrote: > Hello Jeff, > > Thanks for your cooperation. > > --mca btl_tcp_if_include br0 > > worked out of the box. > > The problem was from the network administrator. The machines on the network > side were halting the mpi... > > so cleaning and killing every thing worked. > > :) > > regards. > > > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) > wrote: > There is no "self" IP interface in the Linux kernel. > > Try using btl_tcp_if_include and list just the interface(s) that you want to > use. From your prior email, I'm *guessing* it's just br2 (i.e., the 10.x > address inside your cluster). > > Also, it looks like you didn't setup your SSH keys properly for logging in to > remote notes automatically. > > > > On Mar 24, 2014, at 10:56 AM, Hamid Saeed wrote: > > > Hello, > > > > I added the "self" e.g > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca > > btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv > > > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': > > -- > > > > ERROR:: > > > > At least one pair of MPI processes are unable to reach each other for > > MPI communications. This means that no Open MPI device has indicated > > that it can be used to communicate between these processes. This is > > an error; Open MPI requires that all MPI processes be able to reach > > each other. This error can sometimes be the result of forgetting to > > specify the "self" BTL. > > > > Process 1 ([[15751,1],7]) is on host: wirth > > Process 2 ([[15751,1],0]) is on host: karp > > BTLs attempted: self sm > > > > Your MPI job is now going to abort; sorry. > > -- > > -- > > MPI_INIT has failed because at least one MPI process is unreachable > > from another. This *usually* means that an underlying communication > > plugin -- such as a BTL or an MTL -- has either not loaded or not > > allowed itself to be used. Your MPI job will now abort. > > > > You may wish to try to narrow down the problem; > > > > * Check the output of ompi_info to see which BTL/MTL plugins are > >available. > > * Run your application with MPI_THREAD_SINGLE. > > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > >if using MTL-based communications) to see exactly which > >communication plugins were considered and/or discarded. > > -- > > [wirth:40329] *** An error occurred in MPI_Init > > [wirth:40329] *** on a NULL communicator > > [wirth:40329] *** Unknown error > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > > -- > > An MPI process is aborting at a time when it cannot guarantee that all > > of its peer processes in the job will be killed properly. You should > > double check that everything has shut down cleanly. > > > > Reason: Before MPI_INIT completed > > Local host: wirth > > PID:40329 > > -- > > -- > > mpirun has exited due to process rank 7 with PID 40329 on > > node wirth exiting improperly. There are two reasons this could occur: > > > > 1. this process did not call "init" before exiting, but others in > > the job did. This can cause a job to hang indefinitely while it waits > > for all processes to call "init". By rule, if one process calls "init", > > then ALL processes must call "init" prior to termination. > > > > 2. this process called "init", but exited without calling "finalize". > > By rule, all processes that call "init" MUST call "finalize" prior to > > exiting or it will be considered an "abnormal termination" > > >
Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
Hello, I am not sure what approach does the MPI communication follow but when i use --mca btl_base_verbose 30 I observe the mentioned port. [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on port 4 [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to 134.106.3.252 failed: Connection refused (111) the information on the http://www.open-mpi.org/community/lists/users/2011/11/17732.php is not enough could you kindly explain.. How can restrict MPI communication to use the ports starting from 1025. or use the port some what like 59822... Regards. On Tue, Mar 25, 2014 at 9:15 AM, Reuti wrote: > Hi, > > Am 25.03.2014 um 08:34 schrieb Hamid Saeed: > > > Is it possible to change the port number for the MPI communication? > > > > I can see that my program uses port 4 for the MPI communication. > > > > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 > on port 4 > > > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > > > In my case the ports from 1 to 1024 are reserved. > > MPI tries to use one of the reserve ports and prompts the connection > refused error. > > > > I will be very glade for the kind suggestions. > > There are certain parameters to set the range of used ports, but using any > up to 1024 should not be the default: > > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > > Are any of these set by accident beforehand by your environment? > > -- Reuti > > > > Regards. > > > > > > > > > > > > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed > wrote: > > Hello Jeff, > > > > Thanks for your cooperation. > > > > --mca btl_tcp_if_include br0 > > > > worked out of the box. > > > > The problem was from the network administrator. The machines on the > network side were halting the mpi... > > > > so cleaning and killing every thing worked. > > > > :) > > > > regards. > > > > > > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > There is no "self" IP interface in the Linux kernel. > > > > Try using btl_tcp_if_include and list just the interface(s) that you > want to use. From your prior email, I'm *guessing* it's just br2 (i.e., > the 10.x address inside your cluster). > > > > Also, it looks like you didn't setup your SSH keys properly for logging > in to remote notes automatically. > > > > > > > > On Mar 24, 2014, at 10:56 AM, Hamid Saeed > wrote: > > > > > Hello, > > > > > > I added the "self" e.g > > > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib > --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth > ./scatterv > > > > > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': > > > > -- > > > > > > ERROR:: > > > > > > At least one pair of MPI processes are unable to reach each other for > > > MPI communications. This means that no Open MPI device has indicated > > > that it can be used to communicate between these processes. This is > > > an error; Open MPI requires that all MPI processes be able to reach > > > each other. This error can sometimes be the result of forgetting to > > > specify the "self" BTL. > > > > > > Process 1 ([[15751,1],7]) is on host: wirth > > > Process 2 ([[15751,1],0]) is on host: karp > > > BTLs attempted: self sm > > > > > > Your MPI job is now going to abort; sorry. > > > > -- > > > > -- > > > MPI_INIT has failed because at least one MPI process is unreachable > > > from another. This *usually* means that an underlying communication > > > plugin -- such as a BTL or an MTL -- has either not loaded or not > > > allowed itself to be used. Your MPI job will now abort. > > > > > > You may wish to try to narrow down the problem; > > > > > > * Check the output of ompi_info to see which BTL/MTL plugins are > > >available. > > > * Run your application with MPI_THREAD_SINGLE. > > > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > > >if using MTL-based communications) to see exactly which > > >communication plugins were considered and/or discarded. > > > > -- > > > [wirth:40329] *** An error occurred in MPI_Init > > > [wirth:40329] *** on a NULL communicator > > > [wirth:40329] *** Unknown error > > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > > > > -- > > > An MPI process is aborting at a time when it cannot guarantee that all > > > of its peer processes in the job will be killed properly. You should > > > double check that everything has shut down cleanly. > > > > > > Reason:
Re: [OMPI users] Fwd: problem for multiple clusters using mpirun
Hello, Thanks i figured out what was the exact problem in my case. Now i am using the following execution line. it is directing the mpi comm port to start from 1... mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca btl_tcp_if_include br0 --mca btl_tcp_port_min_v4 1 ./a.out and every thing works again. Thanks. Best regards. On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed wrote: > Hello, > I am not sure what approach does the MPI communication follow but when i > use > --mca btl_base_verbose 30 > > I observe the mentioned port. > > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on > port 4 > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > > the information on the > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > is not enough could you kindly explain.. > > How can restrict MPI communication to use the ports starting from 1025. > or use the port some what like > 59822... > > Regards. > > > > On Tue, Mar 25, 2014 at 9:15 AM, Reuti wrote: > >> Hi, >> >> Am 25.03.2014 um 08:34 schrieb Hamid Saeed: >> >> > Is it possible to change the port number for the MPI communication? >> > >> > I can see that my program uses port 4 for the MPI communication. >> > >> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 >> on port 4 >> > >> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >> connect() to 134.106.3.252 failed: Connection refused (111) >> > >> > In my case the ports from 1 to 1024 are reserved. >> > MPI tries to use one of the reserve ports and prompts the connection >> refused error. >> > >> > I will be very glade for the kind suggestions. >> >> There are certain parameters to set the range of used ports, but using >> any up to 1024 should not be the default: >> >> http://www.open-mpi.org/community/lists/users/2011/11/17732.php >> >> Are any of these set by accident beforehand by your environment? >> >> -- Reuti >> >> >> > Regards. >> > >> > >> > >> > >> > >> > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed >> wrote: >> > Hello Jeff, >> > >> > Thanks for your cooperation. >> > >> > --mca btl_tcp_if_include br0 >> > >> > worked out of the box. >> > >> > The problem was from the network administrator. The machines on the >> network side were halting the mpi... >> > >> > so cleaning and killing every thing worked. >> > >> > :) >> > >> > regards. >> > >> > >> > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > There is no "self" IP interface in the Linux kernel. >> > >> > Try using btl_tcp_if_include and list just the interface(s) that you >> want to use. From your prior email, I'm *guessing* it's just br2 (i.e., >> the 10.x address inside your cluster). >> > >> > Also, it looks like you didn't setup your SSH keys properly for logging >> in to remote notes automatically. >> > >> > >> > >> > On Mar 24, 2014, at 10:56 AM, Hamid Saeed >> wrote: >> > >> > > Hello, >> > > >> > > I added the "self" e.g >> > > >> > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib >> --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth >> ./scatterv >> > > >> > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': >> > > >> -- >> > > >> > > ERROR:: >> > > >> > > At least one pair of MPI processes are unable to reach each other for >> > > MPI communications. This means that no Open MPI device has indicated >> > > that it can be used to communicate between these processes. This is >> > > an error; Open MPI requires that all MPI processes be able to reach >> > > each other. This error can sometimes be the result of forgetting to >> > > specify the "self" BTL. >> > > >> > > Process 1 ([[15751,1],7]) is on host: wirth >> > > Process 2 ([[15751,1],0]) is on host: karp >> > > BTLs attempted: self sm >> > > >> > > Your MPI job is now going to abort; sorry. >> > > >> -- >> > > >> -- >> > > MPI_INIT has failed because at least one MPI process is unreachable >> > > from another. This *usually* means that an underlying communication >> > > plugin -- such as a BTL or an MTL -- has either not loaded or not >> > > allowed itself to be used. Your MPI job will now abort. >> > > >> > > You may wish to try to narrow down the problem; >> > > >> > > * Check the output of ompi_info to see which BTL/MTL plugins are >> > >available. >> > > * Run your application with MPI_THREAD_SINGLE. >> > > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, >> > >if using MTL-based communications) to see exactly which >> > >communication plugins were considered and/or discarded. >> > > >> --
Re: [OMPI users] problem for multiple clusters using mpirun
This is very odd -- the default value for btl_tcp_port_min_v4 is 1024. So unless you have overridden this value, you should not be getting a port less than 1024. You can run this to see: ompi_info --level 9 --param btl tcp --parsable | grep port_min_v4 Mine says this in a default 1.7.5 installation: mca:btl:tcp:param:btl_tcp_port_min_v4:value:1024 mca:btl:tcp:param:btl_tcp_port_min_v4:source:default mca:btl:tcp:param:btl_tcp_port_min_v4:status:writeable mca:btl:tcp:param:btl_tcp_port_min_v4:level:2 mca:btl:tcp:param:btl_tcp_port_min_v4:help:The minimum port where the TCP BTL will try to bind (default 1024) mca:btl:tcp:param:btl_tcp_port_min_v4:deprecated:no mca:btl:tcp:param:btl_tcp_port_min_v4:type:int mca:btl:tcp:param:btl_tcp_port_min_v4:disabled:false On Mar 25, 2014, at 5:36 AM, Hamid Saeed wrote: > Hello, > Thanks i figured out what was the exact problem in my case. > Now i am using the following execution line. > it is directing the mpi comm port to start from 1... > > mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca btl_tcp_if_include br0 > --mca btl_tcp_port_min_v4 1 ./a.out > > and every thing works again. > > Thanks. > > Best regards. > > > > > On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed wrote: > Hello, > I am not sure what approach does the MPI communication follow but when i > use > --mca btl_base_verbose 30 > > I observe the mentioned port. > > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on > port 4 > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > > the information on the > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > is not enough could you kindly explain.. > > How can restrict MPI communication to use the ports starting from 1025. > or use the port some what like > 59822... > > Regards. > > > > On Tue, Mar 25, 2014 at 9:15 AM, Reuti wrote: > Hi, > > Am 25.03.2014 um 08:34 schrieb Hamid Saeed: > > > Is it possible to change the port number for the MPI communication? > > > > I can see that my program uses port 4 for the MPI communication. > > > > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on > > port 4 > > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > > connect() to 134.106.3.252 failed: Connection refused (111) > > > > In my case the ports from 1 to 1024 are reserved. > > MPI tries to use one of the reserve ports and prompts the connection > > refused error. > > > > I will be very glade for the kind suggestions. > > There are certain parameters to set the range of used ports, but using any up > to 1024 should not be the default: > > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > > Are any of these set by accident beforehand by your environment? > > -- Reuti > > > > Regards. > > > > > > > > > > > > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed wrote: > > Hello Jeff, > > > > Thanks for your cooperation. > > > > --mca btl_tcp_if_include br0 > > > > worked out of the box. > > > > The problem was from the network administrator. The machines on the network > > side were halting the mpi... > > > > so cleaning and killing every thing worked. > > > > :) > > > > regards. > > > > > > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) > > wrote: > > There is no "self" IP interface in the Linux kernel. > > > > Try using btl_tcp_if_include and list just the interface(s) that you want > > to use. From your prior email, I'm *guessing* it's just br2 (i.e., the > > 10.x address inside your cluster). > > > > Also, it looks like you didn't setup your SSH keys properly for logging in > > to remote notes automatically. > > > > > > > > On Mar 24, 2014, at 10:56 AM, Hamid Saeed wrote: > > > > > Hello, > > > > > > I added the "self" e.g > > > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca > > > btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv > > > > > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': > > > -- > > > > > > ERROR:: > > > > > > At least one pair of MPI processes are unable to reach each other for > > > MPI communications. This means that no Open MPI device has indicated > > > that it can be used to communicate between these processes. This is > > > an error; Open MPI requires that all MPI processes be able to reach > > > each other. This error can sometimes be the result of forgetting to > > > specify the "self" BTL. > > > > > > Process 1 ([[15751,1],7]) is on host: wirth > > > Process 2 ([[15751,1],0]) is on host: karp > > > BTLs attempted: self sm > > > > > > Your MPI job is now going to abort; sorry. > > > -- > > > -
Re: [OMPI users] OpenMPI-ROMIO-OrangeFS
Edgar Gabriel writes: > I am still looking into the PVFS2 with ROMIO problem with the 1.6 > series, where (as I mentioned yesterday) the problem I am having right > now is that the data is wrong. Not sure what causes it, but since I have > teach this afternoon again, it might be friday until I can digg into that. Was there any progress with this? Otherwise, what version of PVFS2 is known to work with OMPI 1.6? Thanks.
Re: [OMPI users] OpenMPI-ROMIO-OrangeFS
yes, the patch has been submitted to the 1.6 branch for review, not sure what the precise status of it is. The problems found are more or less independent of the PVFS2 version. Thanks Edga On 3/25/2014 7:32 AM, Dave Love wrote: > Edgar Gabriel writes: > >> I am still looking into the PVFS2 with ROMIO problem with the 1.6 >> series, where (as I mentioned yesterday) the problem I am having right >> now is that the data is wrong. Not sure what causes it, but since I have >> teach this afternoon again, it might be friday until I can digg into that. > > Was there any progress with this? Otherwise, what version of PVFS2 is > known to work with OMPI 1.6? Thanks. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Edgar Gabriel Associate Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 signature.asc Description: OpenPGP digital signature
Re: [OMPI users] coll_ml_priority in openmpi-1.7.5
Yes, Nathan has a few coll ml fixes queued up for 1.8. On Mar 24, 2014, at 10:11 PM, tmish...@jcity.maeda.co.jp wrote: > > > I ran our application using the final version of openmpi-1.7.5 again > with coll_ml_priority = 90. > > Then, coll/ml was actually activated and I got these error messages > as shown below: > [manage][[11217,1],0][coll_ml_lmngr.c:265:mca_coll_ml_lmngr_alloc] COLL-ML > List manager is empty. > [manage][[11217,1],0][coll_ml_allocation.c:47:mca_coll_ml_allocate_block] > COLL-ML lmngr failed. > [manage][[11217,1],0][coll_ml_module.c:532:ml_module_memory_initialization] > COLL-ML mca_coll_ml_allocate_block exited wi > th error. > > Unfortunately coll/ml seems to still have some problems ... > > And, it also means coll/ml was not activated on my test run with > coll_ml_priority = 27. So, the slowdown was due to the expensive > connectivity computation as you pointed out, I guess. > > Tetsuya > >> On Mar 20, 2014, at 5:56 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> Hi Ralph, congratulations on releasing new openmpi-1.7.5. >>> >>> By the way, opnempi-1.7.5rc3 has been slowing down our application >>> with smaller size of testing data, where the time consuming part >>> of our application is so called sparse solver. It's negligible >>> with medium or large size data - more practical one, so I have >>> been defering this problem. >>> >>> However, this slowdown disappears in the final version of >>> openmpi-1.7.5. After some investigations, I found coll_ml caused >>> this slowdown. The final version seems to set coll_ml_priority as zero >>> again. >>> >>> Could you explain briefly about the advantage of coll_ml? In what kind >>> of situation it's effective and so on ... >> >> I'm not really the one to speak about coll/ml as I wasn't involved in it > - Nathan would be the one to ask. It is supposed to be significantly faster > for most collectives, but I imagine it would >> depend on the precise collective being used and the size of the data. We > did find and fix a number of problems right at the end (which is why we > dropped the priority until we can better test/debug >> it), and so we might have hit something that was causing your slow down. >> >> >>> >>> In addition, I'm not sure why coll_my is activated in openmpi-1.7.5rc3, >>> although its priority is lower than tuned as described in the message >>> of changeset 30790: >>> We are initially setting the priority lower than >>> tuned until this has had some time to soak in the trunk. >> >> Were you actually seeing coll/ml being used? It shouldn't have been. > However, coll/ml was getting called during the collective initialization > phase so it could set itself up, even if it wasn't being >> used. One part of its setup is a somewhat expensive connectivity > computation - one of our last-minute cleanups was removal of a static 1MB > array in that procedure. Changing the priority to 0 >> completely disables the coll/ml component, thus removing it from even the > initialization phase. My guess is that you were seeing a measurable "hit" > by that procedure on your small data tests, which >> probably ran fairly quickly - and not seeing it on the other tests > because the setup time was swamped by the computation time. >> >> >>> >>> Tetsuya >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] OpenMPI-ROMIO-OrangeFS
Sorry -- we've been focusing on 1.7.5 and the impending 1.8 release; I probably won't be able to look at the v1.6 version in the next 2 weeks or so. On Mar 25, 2014, at 9:09 AM, Edgar Gabriel wrote: > yes, the patch has been submitted to the 1.6 branch for review, not sure > what the precise status of it is. The problems found are more or less > independent of the PVFS2 version. > > Thanks > Edga > > On 3/25/2014 7:32 AM, Dave Love wrote: >> Edgar Gabriel writes: >> >>> I am still looking into the PVFS2 with ROMIO problem with the 1.6 >>> series, where (as I mentioned yesterday) the problem I am having right >>> now is that the data is wrong. Not sure what causes it, but since I have >>> teach this afternoon again, it might be friday until I can digg into that. >> >> Was there any progress with this? Otherwise, what version of PVFS2 is >> known to work with OMPI 1.6? Thanks. >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > -- > Edgar Gabriel > Associate Professor > Parallel Software Technologies Lab http://pstl.cs.uh.edu > Department of Computer Science University of Houston > Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA > Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] OpenMPI-ROMIO-OrangeFS
On 03/25/2014 07:32 AM, Dave Love wrote: Edgar Gabriel writes: I am still looking into the PVFS2 with ROMIO problem with the 1.6 series, where (as I mentioned yesterday) the problem I am having right now is that the data is wrong. Not sure what causes it, but since I have teach this afternoon again, it might be friday until I can digg into that. Was there any progress with this? Otherwise, what version of PVFS2 is known to work with OMPI 1.6? Thanks. Edgar, should I pick this up for MPICH, or was this fix specific to OpenMPI ? ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran
Got your output -- thanks. I'm pretty sure this is pointing to a Libtool bug. Here's the interesting part -- it looks like Libtool simply isn't issuing the command to create the library (!). Check out this (annotated) output from "make V=1" on a Linux/gfortran box: Making all in src make[1]: Entering directory `/home/jsquyres/git/pgi-autotool-bug/src' # Compile the fortran_foo.f90 file /bin/sh ../libtool --tag=FC --mode=compile gfortran -g -O2 -c -o fortran_foo.lo fortran_foo.f90 libtool: compile: gfortran -g -O2 -c fortran_foo.f90 -fPIC -o .libs/fortran_foo.o # Compile the fortran_bar.f90 file /bin/sh ../libtool --tag=FC --mode=compile gfortran -g -O2 -c -o fortran_bar.lo fortran_bar.f90 libtool: compile: gfortran -g -O2 -c fortran_bar.f90 -fPIC -o .libs/fortran_bar.o # Link the two into the libfortran_stuff.so library /bin/sh ../libtool --tag=FC --mode=link gfortran -g -O2 -o libfortran_stuff.la -rpath /usr/local/lib fortran_foo.lo fortran_bar.lo libtool: link: gfortran -shared -fPIC .libs/fortran_foo.o .libs/fortran_bar.o -O2 -Wl,-soname -Wl,libfortran_stuff.so.0 -o .libs/libfortran_stuff.so.0.0.0 # Make some handy sym links libtool: link: (cd ".libs" && rm -f "libfortran_stuff.so.0" && ln -s "libfortran_stuff.so.0.0.0" "libfortran_stuff.so.0") libtool: link: (cd ".libs" && rm -f "libfortran_stuff.so" && ln -s "libfortran_stuff.so.0.0.0" "libfortran_stuff.so") libtool: link: ( cd ".libs" && rm -f "libfortran_stuff.la" && ln -s "../libfortran_stuff.la" "libfortran_stuff.la" ) - Compare this to your "make V=1" output: - Making install in src # Compile the fortran_foo.f90 file /bin/sh ../libtool --tag=FC --mode=compile pgfortran -m64 -c -o fortran_foo.lo fortran_foo.f90 libtool: compile: pgfortran -m64 -c fortran_foo.f90 -o .libs/fortran_foo.o # Compile the fortran_bar.f90 file /bin/sh ../libtool --tag=FC --mode=compile pgfortran -m64 -c -o fortran_bar.lo fortran_bar.f90 libtool: compile: pgfortran -m64 -c fortran_bar.f90 -o .libs/fortran_bar.o # Link the two into the libfortran_stuff.so library /bin/sh ../libtool --tag=FC --mode=link pgfortran -m64 -m64 -o libfortran_stuff.la -rpath /Users/fortran/AutomakeBug/autobug14/lib fortran_foo.lo fortran_bar.lo *** NOTICE THAT THERE'S NO COMMAND HERE TO MAKE THE LIBRARY! # Make some handy sym links libtool: link: (cd ".libs" && rm -f "libfortran_stuff.dylib" && ln -s "libfortran_stuff.0.dylib" "libfortran_stuff.dylib") libtool: link: ( cd ".libs" && rm -f "libfortran_stuff.la" && ln -s "../libfortran_stuff.la" "libfortran_stuff.la" ) - Time to send this bug report upstream. On Mar 24, 2014, at 7:27 PM, Matt Thompson wrote: > Jeff, > > I ran these commands: > > $ make clean > $ make distclean > > (wanted to be extra sure!) > > $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64' > CXXFLAGS='-m64' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' > --prefix=/Users/fortran/AutomakeBug/autobug14 | & tee configure.log > $ make V=1 install |& tee makeV1install.log > > So find attached the config.log, configure.log, and makeV1install.log which > should have all the info you asked about. > > Matt > > PS: I just tried configure/make/make install with Open MPI 1.7.5, but the > same error occurs as expected. Hope springs eternal, you know? > > > On Mon, Mar 24, 2014 at 6:48 PM, Jeff Squyres (jsquyres) > wrote: > On Mar 24, 2014, at 6:34 PM, Matt Thompson wrote: > > > Sorry for the late reply. The answer is: No, 1.14.1 has not fixed the > > problem (and indeed, that's what my Mac is running): > > > > (28) $ make install | & tee makeinstall.log > > Making install in src > > ../config/install-sh -c -d '/Users/fortran/AutomakeBug/autobug14/lib' > > /bin/sh ../libtool --mode=install /usr/bin/install -c > > libfortran_stuff.la '/Users/fortran/AutomakeBug/autobug14/lib' > > libtool: install: /usr/bin/install -c .libs/libfortran_stuff.0.dylib > > /Users/fortran/AutomakeBug/autobug14/lib/libfortran_stuff.0.dylib > > install: .libs/libfortran_stuff.0.dylib: No such file or directory > > make[2]: *** [install-libLTLIBRARIES] Error 71 > > make[1]: *** [install-am] Error 2 > > make: *** [install-recursive] Error 1 > > > > This is the output from either the am12 or am14 test. If you have any > > options you'd like me to try with this, let me know. (For example, is there > > a way to make autotools *more* verbose? I've always tried to make it less > > so!) > > Ok. With the am14 tarball, please run: > > make clean > > And then run this: > > make V=1 install > > And then send the following: > > - configure stdout > - config.log file > - stdout/stderr from "make V=1 install" > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman
Re: [OMPI users] OpenMPI-ROMIO-OrangeFS
not sure honestly. Basically, as suggested in this email chain earlier, I had to disable the PVFS2_IreadContig and PVFS2_IwriteContig routines in ad_pvfs2.c to make the tests pass. Otherwise the tests worked but produced wrong data. I did not have however the time to figure what actually goes wrong underneath the hood. Edgar On 3/25/2014 9:21 AM, Rob Latham wrote: > > > On 03/25/2014 07:32 AM, Dave Love wrote: >> Edgar Gabriel writes: >> >>> I am still looking into the PVFS2 with ROMIO problem with the 1.6 >>> series, where (as I mentioned yesterday) the problem I am having right >>> now is that the data is wrong. Not sure what causes it, but since I have >>> teach this afternoon again, it might be friday until I can digg into >>> that. >> >> Was there any progress with this? Otherwise, what version of PVFS2 is >> known to work with OMPI 1.6? Thanks. > > Edgar, should I pick this up for MPICH, or was this fix specific to > OpenMPI ? > > ==rob > -- Edgar Gabriel Associate Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 signature.asc Description: OpenPGP digital signature
Re: [OMPI users] OpenMPI-ROMIO-OrangeFS
Edgar Gabriel writes: > yes, the patch has been submitted to the 1.6 branch for review, not sure > what the precise status of it is. The problems found are more or less > independent of the PVFS2 version. Thanks; I should have looked in the tracker.
[OMPI users] busy waiting and oversubscriptions
Even when "idle", MPI processes use all the CPU. I thought I remember someone saying that they will be low priority, and so not pose much of an obstacle to other uses of the CPU. At any rate, my question is whether, if I have processes that spend most of their time waiting to receive a message, I can put more of them than I have physical cores without much slowdown? E.g. With 8 cores and 8 processes doing real work, can I add a couple extra processes that mostly wait? Does it make any difference if there's hyperthreading with, e.g., 16 virtual CPUs based on 8 physical ones? In general I try to limit to the number of physical cores. Thanks. Ross Boylan