Re: [OMPI users] qsub error

2013-02-16 Thread Erik Nelson
yep, runs well now.

On Sat, Feb 16, 2013 at 6:50 AM, Jeff Squyres (jsquyres)  wrote:

> Glad you got it working!
>
> On Feb 15, 2013, at 6:53 PM, Erik Nelson  wrote:
>
> > I may have deleted any responses to this message. In either case, we
> appear to have fixed the problem
> > by installing a more current version of openmpi.
> >
> >
> > On Thu, Feb 14, 2013 at 2:27 PM, Erik Nelson 
> wrote:
> >
> > I'm encountering an error using qsub that none of us can figure out. MPI
> C++ programs seem to
> > run fine when executed from the command line, but for some reason when I
> submit them through
> > the queue I get a strange error message ..
> >
> >
> >
> [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
> > connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission
> denied (13)
> >
> >
> > the compute node 3-12 doesn't matter (the error can generate from any of
> the nodes, and I'm
> > guessing that 3-12 is the parent node here).
> >
> > To check if there was some problem with my own code, I created a simple
> 'hello world' program
> > (see attached files).
> >
> > Again, the program runs fine from the command line but fails in qsub
> with the same sort of error
> > message.
> >
> > I have included (i) the code (ii) the job script for qsub, and (iii) the
> ".o" file from qsub for the
> > "hello world" program.
> >
> > These don't look like MPI errors, but rather some conflict with, maybe,
> secure communication
> > across nodes.
> >
> > Is there something simple I can do to fix this?
> >
> > Thanks, Erik
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> >
> >
> >
> > --
> > Erik Nelson
> >
> > Howard Hughes Medical Institute
> > 6001 Forest Park Blvd., Room ND10.124
> > Dallas, Texas 75235-9050
> >
> > p : 214 645 5981
> > f : 214 645 5948
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Erik Nelson

Howard Hughes Medical Institute
6001 Forest Park Blvd., Room ND10.124
Dallas, Texas 75235-9050

p : 214 645 5981
f : 214 645 5948


Re: [OMPI users] All_to_allv algorithm patch

2013-02-16 Thread Jeff Squyres (jsquyres)
Agreed on all of Hristo's points.  

Can you post this over on the devel list?  We've been too sloppy in the past 
about mixing what goes on the users list vs. the devel list.



On Feb 5, 2013, at 4:46 AM, "Iliev, Hristo"  wrote:

> Hi,
> 
> This is the users mailing list. There is a separate one for questions
> related to Open MPI development - de...@open-mpi.org.
> 
> Besides, why don't you open a ticket in the Open MPI Trac at
> https://svn.open-mpi.org/trac/ompi/ and post there patches against trunk? My
> experience shows that even simple changes to the collective framework are of
> low importance to the OMPI development team and chances of such changes
> entering the 1.6.x branch are practically zero. By the way, the off-by-one
> issue was reported almost an year ago and is already fixed in trunk.
> 
> I believe that the rationale behind the algorithm switch was given by either
> Jeff or George some time ago and it was that the linear code does not scale
> to a large number of processes.
> 
> Kind regards,
> Hristo
> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>> On Behalf Of Number Cruncher
>> Sent: Monday, February 04, 2013 5:19 PM
>> To: Open MPI Users
>> Subject: [OMPI users] All_to_allv algorithm patch
>> 
>> I'll try running this by the mailing list again, before resigning myself
> to
>> maintaining this privately
>> 
>> I've looked in detail at the current two MPI_Alltoallv algorithms and
> wanted
>> to raise a couple of ideas.
>> 
>> Firstly, the new default "pairwise" algorithm.
>> * There is no optimisation for sparse/empty messages, compare to the old
>> basic "linear" algorithm.
>> * The attached "pairwise-nop" patch adds this optimisation and on the test
>> case I first described in this thread (1000's of small, sparse,
> all-to-all), this cuts
>> runtime by approximately 30%
>> * I think the upper bound on the loop counter for pairwise exchange is
> off-
>> by-one. As the comment notes "starting from 1 since local exhange [sic] is
>> done"; but when step = (size + 1), the sendto/recvfrom both reduce to rank
>> (self-exchange is already handled in earlier code)
>> 
>> The pairwise algorithm still kills performance on my gigabit ethernet
> network.
>> My message transmission time must be small compared to latency, and the
>> forced MPI_Comm_size() synchronisation steps introduce a minimum delay
>> (single_link_latency * comm_size), i.e. latency scale linearly with
> comm_size.
>> The linear algorithm doesn't wait for each exchange, so its minimum
> latency
>> is just a single transmit/receive.
>> 
>> Which brings me to the second idea. The problem with the existing
>> implementation of the linear algorithm is that the irecv/isend pattern was
>> identical on all processes, meaning that every process starts by having to
> wait
>> for process 0 to send to everyone and every process can finish waiting for
>> rank (size-1) to send to everyone.
>> 
>> It seems preferable to at least post the send/recv requests in the same
> order
>> as the pairwise algorithm. The attached "linear-alltoallv" patch
> implements
>> this and, on my test case, shows some modest 5% improvement.
>> I was wondering if it would address the concerns which led to the switch
> of
>> default algorithm.
>> 
>> Simon
> 
> --
> Hristo Iliev, Ph.D. -- High Performance Computing
> RWTH Aachen University, Center for Computing and Communication
> Rechen- und Kommunikationszentrum der RWTH Aachen
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] qsub error

2013-02-16 Thread Jeff Squyres (jsquyres)
Glad you got it working!

On Feb 15, 2013, at 6:53 PM, Erik Nelson  wrote:

> I may have deleted any responses to this message. In either case, we appear 
> to have fixed the problem 
> by installing a more current version of openmpi.
> 
> 
> On Thu, Feb 14, 2013 at 2:27 PM, Erik Nelson  wrote:
> 
> I'm encountering an error using qsub that none of us can figure out. MPI C++ 
> programs seem to 
> run fine when executed from the command line, but for some reason when I 
> submit them through 
> the queue I get a strange error message ..
> 
> 
> [compute-3-12.local][[58672,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
>  
> connect() to 2002:8170:6c2f:b:21d:9ff:fefd:7d94 failed: Permission denied (13)
> 
> 
> the compute node 3-12 doesn't matter (the error can generate from any of the 
> nodes, and I'm 
> guessing that 3-12 is the parent node here). 
> 
> To check if there was some problem with my own code, I created a simple 
> 'hello world' program 
> (see attached files).
> 
> Again, the program runs fine from the command line but fails in qsub with the 
> same sort of error 
> message.
> 
> I have included (i) the code (ii) the job script for qsub, and (iii) the ".o" 
> file from qsub for the 
> "hello world" program.
> 
> These don't look like MPI errors, but rather some conflict with, maybe, 
> secure communication
> accross nodes.
> 
> Is there something simple I can do to fix this?
> 
> Thanks, Erik 
> 
> -- 
> Erik Nelson
> 
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
> 
> p : 214 645 5981
> f : 214 645 5948
> 
> 
> 
> -- 
> Erik Nelson
> 
> Howard Hughes Medical Institute
> 6001 Forest Park Blvd., Room ND10.124
> Dallas, Texas 75235-9050
> 
> p : 214 645 5981
> f : 214 645 5948
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/