Re: [OMPI devel] Nearly unlimited growth of pml free list

2013-10-01 Thread Max Staufer

George,

 well the code itself runs fine, its just that the ompi send list 
keeps allocating memory, and I pinpointed it to this single call.
Probably the root problem is elsewhere, but it appears to me that the 
entries in the send list are not released for reuse after the

operation completed.

The Size of the operation is 3 doubles.

Max

Am 01.10.2013 01:40, schrieb George Bosilca:

Max,

The recursive call should not be an issue, as the MPI_Allreduce is a blocking 
operation, you can't recurse before the previous call completes.

What is the size of the data exchanged in the MPI_Alltoall?

George.


On Sep 30, 2013, at 17:09 , Max Staufer  wrote:


Well, havent tryed 1.7.2 yet, but too elaborate the problem a little bit more,

  the groth happens if we use an MPI_ALLREDUCE in a recursive subroutine call, 
that means in FORTRAN90 speech the
subroutine calls itself again, and  is specially marked in order to work 
properly. Apart from that nothing is special
with this routine. Is it possible that the F77 interface in Openmpi is not able 
to cope with recursions ?

MAX



Am 13.09.13 17:18, schrieb Rolf vandeVaart:

Yes, it appears the send_requests list is the one that is growing.  This list 
holds the send request structures that are in use.  After a send is completed, 
a send request is supposed to be returned to this list and then get re-used.

With 7 processes, it had reached a size of 16,324 send requests in use.  With 
the 8 processes, it had reached a size of 16,708.  Each send request is 720 
bytes (in debug build it is 872) and if we do the math we have consumed about 
12 Mbytes.

Setting some type of bound will not fix this issue.  There is something else 
going on here that is causing this problem.   I know you described the problem 
earlier on, but maybe you can explain again?  How many processes?  What type of 
cluster?One other thought is perhaps trying Open MPI 1.7.2 to see if you 
still see the problem.   Maybe someone else has suggestions too.

Rolf

PS: For those who missed a private email, I had Max add some instrumentation so 
we could see which list was growing.  We now know it is the 
mca_pml_base_send_requests list.


-Original Message-
From: Max Staufer [mailto:max.stau...@gmx.net]
Sent: Friday, September 13, 2013 7:06 AM
To: Rolf vandeVaart;de...@open-mpi.org
Subject: Re: [OMPI devel] Nearly unlimited growth of pml free list

Hi Rolf,

I applied your patch, the full output is rather big, even gzip > 10Mb, 
which is
not good for the mailinglist, but the head and tail are below for a 7 and 8
processor run.
Seem that the send requests are growing fast 4000 times in just 10 min.

Do you now of a method to bound the list such that it is not growing excessivly
?

thanks

Max

7 Processor run
--
[gpu207.dev-env.lan:11236] Iteration = 0 sleeping [gpu207.dev-env.lan:11236]
Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11236]
Freelist=recv_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11236]
Freelist=pending_pckts, numAlloc=4, maxAlloc=-1 [gpu207.dev-
env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4,
maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=-
1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4,
maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0,
recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev-
env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping
[gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=-
1 [gpu207.dev-env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4,
maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=-
1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4,
maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0,
recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev-
env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping
[gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=-
1 [gpu207.dev-env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4,
maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=-
1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc=4,
maxAlloc=-1 [gpu207.dev-env.lan:11236] rdma_pending=0, pckt_pending=0,
recv_pending=0, send_pending=0, comm_pending=0 [gpu207.dev-
env.lan:11236] [gpu207.dev-env.lan:11236] Iteration = 0 sleeping
[gpu207.dev-env.lan:11236] Freelist=rdma_frags, numAlloc=4, maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=recv_frags, numAlloc=4, maxAlloc=-1
[gpu207.dev-env.lan:11236] Freelist=pending_pckts, numAlloc=4, maxAlloc=-
1 [gpu207.dev-env.lan:1123

Re: [OMPI devel] Nearly unlimited growth of pml free list

2013-10-01 Thread George Bosilca
With a size of 3 doubles all requests are going out in the eager mode. The data 
will be copied into our internal buffers, and the MPI request will be marked as 
complete (this is deep MPI voodoo, I'm just trying to explain the next 
sentence). Thus all sends will look as asynchronous from a user perspective, 
they will happen while the request was returned as completed but before we 
release it internally. Now, if you have millions of such calls, I can imagine a 
way where the driver is overloaded and will start packing up requests, in a way 
that will look like the requests list is growing without limit.

Let's try to see if this is indeed the case:

1. Set the eager for your network to 0 (this will force all messages to go via 
the rdv protocol). For this find out what network you are using (maybe via the 
--mca btl parameter you provided), and set their eager to 0. For example for 
TCP you can use "--mca btl_tcp_eager_limit 0"

2. Alter your code to add a barrier every K recursions (K should be a large 
value like few hundreds). This will provide a means for the network to be 
drained.

3. Are you sure you have no MPI_Isend with a similar size in your code that are 
not correctly completed?

  George.

On Oct 1, 2013, at 09:41 , Max Staufer  wrote:

> George,
> 
> well the code itself runs fine, its just that the ompi send list keeps 
> allocating memory, and I pinpointed it to this single call.
> Probably the root problem is elsewhere, but it appears to me that the entries 
> in the send list are not released for reuse after the
> operation completed.
> 
> The Size of the operation is 3 doubles.
> 
> Max
> 
> Am 01.10.2013 01:40, schrieb George Bosilca:
>> Max,
>> 
>> The recursive call should not be an issue, as the MPI_Allreduce is a 
>> blocking operation, you can't recurse before the previous call completes.
>> 
>> What is the size of the data exchanged in the MPI_Alltoall?
>> 
>> George.
>> 
>> 
>> On Sep 30, 2013, at 17:09 , Max Staufer  wrote:
>> 
>>> Well, havent tryed 1.7.2 yet, but too elaborate the problem a little bit 
>>> more,
>>> 
>>>  the groth happens if we use an MPI_ALLREDUCE in a recursive subroutine 
>>> call, that means in FORTRAN90 speech the
>>> subroutine calls itself again, and  is specially marked in order to work 
>>> properly. Apart from that nothing is special
>>> with this routine. Is it possible that the F77 interface in Openmpi is not 
>>> able to cope with recursions ?
>>> 
>>> MAX
>>> 
>>> 
>>> 
>>> Am 13.09.13 17:18, schrieb Rolf vandeVaart:
 Yes, it appears the send_requests list is the one that is growing.  This 
 list holds the send request structures that are in use.  After a send is 
 completed, a send request is supposed to be returned to this list and then 
 get re-used.
 
 With 7 processes, it had reached a size of 16,324 send requests in use.  
 With the 8 processes, it had reached a size of 16,708.  Each send request 
 is 720 bytes (in debug build it is 872) and if we do the math we have 
 consumed about 12 Mbytes.
 
 Setting some type of bound will not fix this issue.  There is something 
 else going on here that is causing this problem.   I know you described 
 the problem earlier on, but maybe you can explain again?  How many 
 processes?  What type of cluster?One other thought is perhaps trying 
 Open MPI 1.7.2 to see if you still see the problem.   Maybe someone else 
 has suggestions too.
 
 Rolf
 
 PS: For those who missed a private email, I had Max add some 
 instrumentation so we could see which list was growing.  We now know it is 
 the mca_pml_base_send_requests list.
 
> -Original Message-
> From: Max Staufer [mailto:max.stau...@gmx.net]
> Sent: Friday, September 13, 2013 7:06 AM
> To: Rolf vandeVaart;de...@open-mpi.org
> Subject: Re: [OMPI devel] Nearly unlimited growth of pml free list
> 
> Hi Rolf,
> 
>I applied your patch, the full output is rather big, even gzip > 10Mb, 
> which is
> not good for the mailinglist, but the head and tail are below for a 7 and 
> 8
> processor run.
> Seem that the send requests are growing fast 4000 times in just 10 min.
> 
> Do you now of a method to bound the list such that it is not growing 
> excessivly
> ?
> 
> thanks
> 
> Max
> 
> 7 Processor run
> --
> [gpu207.dev-env.lan:11236] Iteration = 0 sleeping 
> [gpu207.dev-env.lan:11236]
> Freelist=rdma_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11236]
> Freelist=recv_frags, numAlloc=4, maxAlloc=-1 [gpu207.dev-env.lan:11236]
> Freelist=pending_pckts, numAlloc=4, maxAlloc=-1 [gpu207.dev-
> env.lan:11236] Freelist=send_ranges_pckts, numAlloc=4,
> maxAlloc=-1
> [gpu207.dev-env.lan:11236] Freelist=send_requests, numAlloc=4, maxAlloc=-
> 1 [gpu207.dev-env.lan:11236] Freelist=recv_requests, numAlloc

[OMPI devel] OMPI dev meeting: December?

2013-10-01 Thread Jeff Squyres (jsquyres)
Ralph brought up the idea of a face-to-face developers meeting yesterday.  How 
about piggybacking on the December Chicago MPI Forum meeting?

1. The Forum starts 1pm Mon, Dec 9, and runs through ?11am? Thu, Dec 12 (it 
might be later -- like noon or something; I don't remember).  It's at the 
Microsoft facility in the Aon Center in downtown Chicago.

2. Since the Forum will be there in on Thursday morning, Fab/Microsoft probably 
wouldn't mind booking us a room for the rest of the afternoon.

3. If we want to meet at all on Friday, there's a Cisco office in Rosemont; we 
have booked a meeting room for all day in case this is useful (we can cancel at 
any time).  The Cisco office is quite close to ORD (it is not downtown).  One 
option might be to meet on Friday until, say, noon or so, and people can easily 
fly out on Friday afternoon.

Thoughts?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OMPI dev meeting: December?

2013-10-01 Thread Nathan Hjelm

On Tue, Oct 01, 2013 at 03:42:15PM +, Jeff Squyres (jsquyres) wrote:
> Ralph brought up the idea of a face-to-face developers meeting yesterday.  
> How about piggybacking on the December Chicago MPI Forum meeting?
> 
> 1. The Forum starts 1pm Mon, Dec 9, and runs through ?11am? Thu, Dec 12 (it 
> might be later -- like noon or something; I don't remember).  It's at the 
> Microsoft facility in the Aon Center in downtown Chicago.
> 
> 2. Since the Forum will be there in on Thursday morning, Fab/Microsoft 
> probably wouldn't mind booking us a room for the rest of the afternoon.
> 
> 3. If we want to meet at all on Friday, there's a Cisco office in Rosemont; 
> we have booked a meeting room for all day in case this is useful (we can 
> cancel at any time).  The Cisco office is quite close to ORD (it is not 
> downtown).  One option might be to meet on Friday until, say, noon or so, and 
> people can easily fly out on Friday afternoon.

A little out of the way for me (I prefer MDW) but sounds good. Lets try to nail 
this down by the end of the week so I can adjust my lodging and get my travel 
approval started.

-Nathan


Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Barrett, Brian W
On 10/1/13 9:46 AM, "Nathan Hjelm"  wrote:

>
>On Tue, Oct 01, 2013 at 03:42:15PM +, Jeff Squyres (jsquyres) wrote:
>> Ralph brought up the idea of a face-to-face developers meeting
>>yesterday.  How about piggybacking on the December Chicago MPI Forum
>>meeting?
>> 
>> 1. The Forum starts 1pm Mon, Dec 9, and runs through ?11am? Thu, Dec 12
>>(it might be later -- like noon or something; I don't remember).  It's
>>at the Microsoft facility in the Aon Center in downtown Chicago.
>> 
>> 2. Since the Forum will be there in on Thursday morning, Fab/Microsoft
>>probably wouldn't mind booking us a room for the rest of the afternoon.
>> 
>> 3. If we want to meet at all on Friday, there's a Cisco office in
>>Rosemont; we have booked a meeting room for all day in case this is
>>useful (we can cancel at any time).  The Cisco office is quite close to
>>ORD (it is not downtown).  One option might be to meet on Friday until,
>>say, noon or so, and people can easily fly out on Friday afternoon.
>
>A little out of the way for me (I prefer MDW) but sounds good. Lets try
>to nail this down by the end of the week so I can adjust my lodging and
>get my travel approval started.

I could catch a flight out of ORD instead of MDW if required, but would
prefer not to have to get a rental car; is there a taxiable option to the
Rosemont facility?

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Jeff Squyres (jsquyres)
On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  wrote:

> I could catch a flight out of ORD instead of MDW if required, but would
> prefer not to have to get a rental car; is there a taxiable option to the
> Rosemont facility?


Yes.  You can taxi or train to the Cisco office from ORD.

We have had OMPI dev meetings there before -- do you remember the facility?  
It's a walkable distance from where we used to have the MPI Forum meetings in 
Rosemont.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread David Goodell (dgoodell)
On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)" 
 wrote:

> On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  wrote:
> 
>> I could catch a flight out of ORD instead of MDW if required, but would
>> prefer not to have to get a rental car; is there a taxiable option to the
>> Rosemont facility?
> 
> Yes.  You can taxi or train to the Cisco office from ORD.
> 
> We have had OMPI dev meetings there before -- do you remember the facility?  
> It's a walkable distance from where we used to have the MPI Forum meetings in 
> Rosemont.

This might help: 

http://goo.gl/maps/RyBqV

-Dave



Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Nathan Hjelm
Ok, that should be fine. Its a long way to MDW on the L (> 1 hr) but it can't 
be helped.

-Nathan

On Tue, Oct 01, 2013 at 04:09:43PM +, David Goodell (dgoodell) wrote:
> On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)" 
>  wrote:
> 
> > On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  wrote:
> > 
> >> I could catch a flight out of ORD instead of MDW if required, but would
> >> prefer not to have to get a rental car; is there a taxiable option to the
> >> Rosemont facility?
> > 
> > Yes.  You can taxi or train to the Cisco office from ORD.
> > 
> > We have had OMPI dev meetings there before -- do you remember the facility? 
> >  It's a walkable distance from where we used to have the MPI Forum meetings 
> > in Rosemont.
> 
> This might help: 
> 
> http://goo.gl/maps/RyBqV
> 
> -Dave
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread David Goodell (dgoodell)
A taxi or shuttle would be faster and is probably not unreasonably expensive:

http://www.flychicago.com/OHare/EN/GettingToFrom/Transport_Between_Airports/Transport-Between-Airports.aspx

-Dave

On Oct 1, 2013, at 11:30 AM, Nathan Hjelm  wrote:

> Ok, that should be fine. Its a long way to MDW on the L (> 1 hr) but it can't 
> be helped.
> 
> -Nathan
> 
> On Tue, Oct 01, 2013 at 04:09:43PM +, David Goodell (dgoodell) wrote:
>> On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)" 
>> wrote:
>> 
>>> On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  wrote:
>>> 
 I could catch a flight out of ORD instead of MDW if required, but would
 prefer not to have to get a rental car; is there a taxiable option to the
 Rosemont facility?
>>> 
>>> Yes.  You can taxi or train to the Cisco office from ORD.
>>> 
>>> We have had OMPI dev meetings there before -- do you remember the facility? 
>>>  It's a walkable distance from where we used to have the MPI Forum meetings 
>>> in Rosemont.
>> 
>> This might help: 
>> 
>> http://goo.gl/maps/RyBqV
>> 
>> -Dave
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Ralph Castain
My only concern is that we've tried tagging on the end of the Forum before, and 
I'm not convinced there is adequate time. This schedule essentially leaves us 
some time Thurs afternoon and then Fri morning, which is hardly worth the trip 
for us non-Forum types.

If we want to really tackle thread safety, expanded MPI rank definitions, 
enhanced connection security (CORAL requires that we only allow messages 
between "authorized" jobs, but support full MPI-2 dynamics), etc., we'll need 
more than a couple of hours.


On Oct 1, 2013, at 9:50 AM, David Goodell (dgoodell)  wrote:

> A taxi or shuttle would be faster and is probably not unreasonably expensive:
> 
> http://www.flychicago.com/OHare/EN/GettingToFrom/Transport_Between_Airports/Transport-Between-Airports.aspx
> 
> -Dave
> 
> On Oct 1, 2013, at 11:30 AM, Nathan Hjelm  wrote:
> 
>> Ok, that should be fine. Its a long way to MDW on the L (> 1 hr) but it 
>> can't be helped.
>> 
>> -Nathan
>> 
>> On Tue, Oct 01, 2013 at 04:09:43PM +, David Goodell (dgoodell) wrote:
>>> On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)" 
>>> wrote:
>>> 
 On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  wrote:
 
> I could catch a flight out of ORD instead of MDW if required, but would
> prefer not to have to get a rental car; is there a taxiable option to the
> Rosemont facility?
 
 Yes.  You can taxi or train to the Cisco office from ORD.
 
 We have had OMPI dev meetings there before -- do you remember the 
 facility?  It's a walkable distance from where we used to have the MPI 
 Forum meetings in Rosemont.
>>> 
>>> This might help: 
>>> 
>>> http://goo.gl/maps/RyBqV
>>> 
>>> -Dave
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Jeff Squyres (jsquyres)
Fair enough.  Got a counter proposal?  :-)

On Oct 1, 2013, at 6:07 PM, Ralph Castain  wrote:

> My only concern is that we've tried tagging on the end of the Forum before, 
> and I'm not convinced there is adequate time. This schedule essentially 
> leaves us some time Thurs afternoon and then Fri morning, which is hardly 
> worth the trip for us non-Forum types.
> 
> If we want to really tackle thread safety, expanded MPI rank definitions, 
> enhanced connection security (CORAL requires that we only allow messages 
> between "authorized" jobs, but support full MPI-2 dynamics), etc., we'll need 
> more than a couple of hours.
> 
> 
> On Oct 1, 2013, at 9:50 AM, David Goodell (dgoodell)  
> wrote:
> 
>> A taxi or shuttle would be faster and is probably not unreasonably expensive:
>> 
>> http://www.flychicago.com/OHare/EN/GettingToFrom/Transport_Between_Airports/Transport-Between-Airports.aspx
>> 
>> -Dave
>> 
>> On Oct 1, 2013, at 11:30 AM, Nathan Hjelm  wrote:
>> 
>>> Ok, that should be fine. Its a long way to MDW on the L (> 1 hr) but it 
>>> can't be helped.
>>> 
>>> -Nathan
>>> 
>>> On Tue, Oct 01, 2013 at 04:09:43PM +, David Goodell (dgoodell) wrote:
 On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)" 
 wrote:
 
> On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  
> wrote:
> 
>> I could catch a flight out of ORD instead of MDW if required, but would
>> prefer not to have to get a rental car; is there a taxiable option to the
>> Rosemont facility?
> 
> Yes.  You can taxi or train to the Cisco office from ORD.
> 
> We have had OMPI dev meetings there before -- do you remember the 
> facility?  It's a walkable distance from where we used to have the MPI 
> Forum meetings in Rosemont.
 
 This might help: 
 
 http://goo.gl/maps/RyBqV
 
 -Dave
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Ralph Castain
Depends on people's travel ability - are people able to do a dedicated trip? If 
not, then we may just have to try to focus the meeting tightly to one or two 
topics, I suppose


On Oct 1, 2013, at 3:18 PM, "Jeff Squyres (jsquyres)"  
wrote:

> Fair enough.  Got a counter proposal?  :-)
> 
> On Oct 1, 2013, at 6:07 PM, Ralph Castain  wrote:
> 
>> My only concern is that we've tried tagging on the end of the Forum before, 
>> and I'm not convinced there is adequate time. This schedule essentially 
>> leaves us some time Thurs afternoon and then Fri morning, which is hardly 
>> worth the trip for us non-Forum types.
>> 
>> If we want to really tackle thread safety, expanded MPI rank definitions, 
>> enhanced connection security (CORAL requires that we only allow messages 
>> between "authorized" jobs, but support full MPI-2 dynamics), etc., we'll 
>> need more than a couple of hours.
>> 
>> 
>> On Oct 1, 2013, at 9:50 AM, David Goodell (dgoodell)  
>> wrote:
>> 
>>> A taxi or shuttle would be faster and is probably not unreasonably 
>>> expensive:
>>> 
>>> http://www.flychicago.com/OHare/EN/GettingToFrom/Transport_Between_Airports/Transport-Between-Airports.aspx
>>> 
>>> -Dave
>>> 
>>> On Oct 1, 2013, at 11:30 AM, Nathan Hjelm  wrote:
>>> 
 Ok, that should be fine. Its a long way to MDW on the L (> 1 hr) but it 
 can't be helped.
 
 -Nathan
 
 On Tue, Oct 01, 2013 at 04:09:43PM +, David Goodell (dgoodell) wrote:
> On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)" 
> 
> wrote:
> 
>> On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"  
>> wrote:
>> 
>>> I could catch a flight out of ORD instead of MDW if required, but would
>>> prefer not to have to get a rental car; is there a taxiable option to 
>>> the
>>> Rosemont facility?
>> 
>> Yes.  You can taxi or train to the Cisco office from ORD.
>> 
>> We have had OMPI dev meetings there before -- do you remember the 
>> facility?  It's a walkable distance from where we used to have the MPI 
>> Forum meetings in Rosemont.
> 
> This might help: 
> 
> http://goo.gl/maps/RyBqV
> 
> -Dave
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [EXTERNAL] Re: OMPI dev meeting: December?

2013-10-01 Thread Barrett, Brian W
Scheduling's going to be difficult with SC, Thanksgiving, MPI Forum, and
Christmas, but I could do a separate trip.  Once I get this stupid
one-sided code finished, I'm supposed to work on MPI_THREAD_MULTIPLE
support for the MTLs for a Sandia project, so it's pretty easy to justify.

Sandia would be willing to host, but ABQ probably wouldn't be the most
convenient place for most people to get to (also, not only don't we have
beer in the fridge, you have to pay for the pop).

Brian

On 10/1/13 4:26 PM, "Ralph Castain"  wrote:

>Depends on people's travel ability - are people able to do a dedicated
>trip? If not, then we may just have to try to focus the meeting tightly
>to one or two topics, I suppose
>
>
>On Oct 1, 2013, at 3:18 PM, "Jeff Squyres (jsquyres)"
> wrote:
>
>> Fair enough.  Got a counter proposal?  :-)
>> 
>> On Oct 1, 2013, at 6:07 PM, Ralph Castain  wrote:
>> 
>>> My only concern is that we've tried tagging on the end of the Forum
>>>before, and I'm not convinced there is adequate time. This schedule
>>>essentially leaves us some time Thurs afternoon and then Fri morning,
>>>which is hardly worth the trip for us non-Forum types.
>>> 
>>> If we want to really tackle thread safety, expanded MPI rank
>>>definitions, enhanced connection security (CORAL requires that we only
>>>allow messages between "authorized" jobs, but support full MPI-2
>>>dynamics), etc., we'll need more than a couple of hours.
>>> 
>>> 
>>> On Oct 1, 2013, at 9:50 AM, David Goodell (dgoodell)
>>> wrote:
>>> 
 A taxi or shuttle would be faster and is probably not unreasonably
expensive:
 
 
http://www.flychicago.com/OHare/EN/GettingToFrom/Transport_Between_Airp
orts/Transport-Between-Airports.aspx
 
 -Dave
 
 On Oct 1, 2013, at 11:30 AM, Nathan Hjelm  wrote:
 
> Ok, that should be fine. Its a long way to MDW on the L (> 1 hr) but
>it can't be helped.
> 
> -Nathan
> 
> On Tue, Oct 01, 2013 at 04:09:43PM +, David Goodell (dgoodell)
>wrote:
>> On Oct 1, 2013, at 11:08 AM, "Jeff Squyres (jsquyres)"
>>
>> wrote:
>> 
>>> On Oct 1, 2013, at 11:56 AM, "Barrett, Brian W"
>>> wrote:
>>> 
 I could catch a flight out of ORD instead of MDW if required, but
would
 prefer not to have to get a rental car; is there a taxiable
option to the
 Rosemont facility?
>>> 
>>> Yes.  You can taxi or train to the Cisco office from ORD.
>>> 
>>> We have had OMPI dev meetings there before -- do you remember the
>>>facility?  It's a walkable distance from where we used to have the
>>>MPI Forum meetings in Rosemont.
>> 
>> This might help:
>> 
>> http://goo.gl/maps/RyBqV
>> 
>> -Dave
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>>http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] 1.7.3rc2 is out

2013-10-01 Thread Jeff Squyres (jsquyres)
In the usual place:

http://www.open-mpi.org/software/ompi/v1.7/

Please test.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.7.3rc2 is out

2013-10-01 Thread Joshua Ladd
I am getting the following failure on two hosts, 1 proc per host. We are 
running SLURM 2.6.2.

joshual@mir13 ~/ompi_1.7/openmpi-1.7.3rc1/examples 
$mpirun -np 2 -bynode hostname
*** buffer overflow detected ***: mpirun terminated
=== Backtrace: =
/lib64/libc.so.6(__fortify_fail+0x37)[0x3e564ff3f7]
/lib64/libc.so.6[0x3e564fd2e0]
/lib64/libc.so.6[0x3e564fc9db]
/lib64/libc.so.6(__snprintf_chk+0x7a)[0x3e564fc8aa]
/usr/lib64/libpmi2.so.0(+0x18ff)[0x7ff671cda8ff]
/usr/lib64/libpmi2.so.0(+0x1e5b)[0x7ff671cdae5b]
/usr/lib64/libpmi2.so.0(PMI2_Job_GetId+0x98)[0x7ff671cddc58]
/hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/openmpi/mca_db_pmi.so(+0x177d)[0x7ff671ef477d]
/hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/libopen-pal.so.5(opal_db_base_select+0xdd)[0x7ff6737eb33d]
/hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/openmpi/mca_ess_hnp.so(+0x3266)[0x7ff673123266]
/hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/libopen-rte.so.5(orte_init+0x17f)[0x7ff673a7251f]
mpirun(orterun+0x6b5)[0x403e25]
mpirun(main+0x20)[0x4035c4]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3e5641ecdd]
mpirun[0x4034e9]
=== Memory map: 
0040-0040d000 r-xp  00:18 29289341   
/hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/bin/orterun
0060c000-0060e000 rw-p c000 00:18 29289341   
/hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/bin/orterun
0060e000-0060f000 rw-p  00:00 0 
00b8b000-00ccd000 rw-p  00:00 0  [heap]
3e55c0-3e55c2 r-xp  08:01 620577 
/lib64/ld-2.12.so
3e55e1f000-3e55e2 r--p 0001f000 08:01 620577 
/lib64/ld-2.12.so
3e55e2-3e55e21000 rw-p 0002 08:01 620577 
/lib64/ld-2.12.so
3e55e21000-3e55e22000 rw-p  00:00 0 
3e5600-3e56002000 r-xp  08:01 620581 
/lib64/libdl-2.12.so
3e56002000-3e56202000 ---p 2000 08:01 620581 
/lib64/libdl-2.12.so
3e56202000-3e56203000 r--p 2000 08:01 620581 
/lib64/libdl-2.12.so
3e56203000-3e56204000 rw-p 3000 08:01 620581 
/lib64/libdl-2.12.so
3e5640-3e56597000 r-xp  08:01 620578 
/lib64/libc-2.12.so
3e56597000-3e56797000 ---p 00197000 08:01 620578 
/lib64/libc-2.12.so
3e56797000-3e5679b000 r--p 00197000 08:01 620578 
/lib64/libc-2.12.so
3e5679b000-3e5679c000 rw-p 0019b000 08:01 620578 
/lib64/libc-2.12.so
3e5679c000-3e567a1000 rw-p  00:00 0 
3e5680-3e56883000 r-xp  08:01 620585 
/lib64/libm-2.12.so
3e56883000-3e56a82000 ---p 00083000 08:01 620585 
/lib64/libm-2.12.so
3e56a82000-3e56a83000 r--p 00082000 08:01 620585 
/lib64/libm-2.12.so
3e56a83000-3e56a84000 rw-p 00083000 08:01 620585 
/lib64/libm-2.12.so
3e56c0-3e56c17000 r-xp  08:01 620580 
/lib64/libpthread-2.12.so
3e56c17000-3e56e16000 ---p 00017000 08:01 620580 
/lib64/libpthread-2.12.so
3e56e16000-3e56e17000 r--p 00016000 08:01 620580 
/lib64/libpthread-2.12.so
3e56e17000-3e56e18000 rw-p 00017000 08:01 620580 
/lib64/libpthread-2.12.so
3e56e18000-3e56e1c000 rw-p  00:00 0 
3e5740-3e57407000 r-xp  08:01 620582 
/lib64/librt-2.12.so
3e57407000-3e57606000 ---p 7000 08:01 620582 
/lib64/librt-2.12.so
3e57606000-3e57607000 r--p 6000 08:01 620582 
/lib64/librt-2.12.so
3e57607000-3e57608000 rw-p 7000 08:01 620582 
/lib64/librt-2.12.so
3e57c0-3e57c08000 r-xp  08:01 228599 
/usr/lib64/libpciaccess.so.0.10.8
3e57c08000-3e57e07000 ---p 8000 08:01 228599 
/usr/lib64/libpciaccess.so.0.10.8
3e57e07000-3e57e08000 rw-p 7000 08:01 228599 
/usr/lib64/libpciaccess.so.0.10.8
3e5a40-3e5a416000 r-xp  08:01 620586 
/lib64/libgcc_s-4.4.6-20110824.so.1
3e5a416000-3e5a615000 ---p 00016000 08:01 620586 
/lib64/libgcc_s-4.4.6-20110824.so.1
3e5a615000-3e5a616000 rw-p 00015000 08:01 620586 
/lib64/libgcc_s-4.4.6-20110824.so.1
3e6600-3e66016000 r-xp  08:01 620617 
/lib64/libnsl-2.12.so
3e66016000-3e66215000 ---p 00016000 08:01 620617 
/lib64/libnsl-2.12.so
3e66215000-3e66216000 r--p 00015000 08:01 620617 
/lib64/libnsl-2.12.so
3e66216000-3e66217000 rw-p 00016000 08:01 620617 

[OMPI devel] Trunk is broken

2013-10-01 Thread Joshua Ladd
Also getting a compile failure in the trunk:

./autogen.pl && ./configure 
--prefix=/hpc/home/USERS/joshual/ompi_trunk/really-the-trunk/ompi-install  
--with-mxm=/hpc/local/src/mxm2_release --with-fca=/opt/mellanox/fca --with-pmi 
&& make -j 9 && make install

  CC   ess_slurm_module.lo
  CCLD mca_ess_slurm.la
make[2]: Leaving directory 
`/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte/mca/ess/slurm'
Making all in mca/ess/pmi
make[2]: Entering directory 
`/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte/mca/ess/pmi'
  CC   ess_pmi_component.lo
  CC   ess_pmi_module.lo
ess_pmi_module.c: In function 'rte_init':
ess_pmi_module.c:285: warning: comparison between signed and unsigned integer 
expressions
ess_pmi_module.c:321: error: 'procs' undeclared (first use in this function)
ess_pmi_module.c:321: error: (Each undeclared identifier is reported only once
ess_pmi_module.c:321: error: for each function it appears in.)
make[2]: *** [ess_pmi_module.lo] Error 1
make[2]: Leaving directory 
`/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte/mca/ess/pmi'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte'
make: *** [all-recursive] Error 1

Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898




Re: [OMPI devel] Trunk is broken

2013-10-01 Thread Ralph Castain
ya, i'm fixing it now

On Oct 1, 2013, at 5:55 PM, Joshua Ladd  wrote:

> Also getting a compile failure in the trunk:
>  
> ./autogen.pl && ./configure 
> --prefix=/hpc/home/USERS/joshual/ompi_trunk/really-the-trunk/ompi-install  
> --with-mxm=/hpc/local/src/mxm2_release --with-fca=/opt/mellanox/fca 
> --with-pmi && make -j 9 && make install
>  
>   CC   ess_slurm_module.lo
>   CCLD mca_ess_slurm.la
> make[2]: Leaving directory 
> `/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte/mca/ess/slurm'
> Making all in mca/ess/pmi
> make[2]: Entering directory 
> `/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte/mca/ess/pmi'
>   CC   ess_pmi_component.lo
>   CC   ess_pmi_module.lo
> ess_pmi_module.c: In function 'rte_init':
> ess_pmi_module.c:285: warning: comparison between signed and unsigned integer 
> expressions
> ess_pmi_module.c:321: error: 'procs' undeclared (first use in this function)
> ess_pmi_module.c:321: error: (Each undeclared identifier is reported only once
> ess_pmi_module.c:321: error: for each function it appears in.)
> make[2]: *** [ess_pmi_module.lo] Error 1
> make[2]: Leaving directory 
> `/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte/mca/ess/pmi'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> `/hpc/home/USERS/joshual/ompi_trunk/really-shmem-trunk/orte'
> make: *** [all-recursive] Error 1
>  
> Joshua S. Ladd, PhD
> HPC Algorithms Engineer
> Mellanox Technologies
>  
> Email: josh...@mellanox.com
> Cell: +1 (865) 258 - 8898
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] 1.7.3rc2 is out

2013-10-01 Thread Ralph Castain
Hmmm...working for me, though with an earlier version of Slurm.

It looks like you are seeing a failure in Job_Getid in the libpmi2 support. I 
wonder if you have a problem in that library? Can you check the arguments going 
to it? Perhaps the max value length is too big or something?

You might also try without the pmi2 lib and see if that fixes it.


On Oct 1, 2013, at 5:21 PM, Joshua Ladd  wrote:

> I am getting the following failure on two hosts, 1 proc per host. We are 
> running SLURM 2.6.2.
> 
> joshual@mir13 ~/ompi_1.7/openmpi-1.7.3rc1/examples 
> $mpirun -np 2 -bynode hostname
> *** buffer overflow detected ***: mpirun terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x3e564ff3f7]
> /lib64/libc.so.6[0x3e564fd2e0]
> /lib64/libc.so.6[0x3e564fc9db]
> /lib64/libc.so.6(__snprintf_chk+0x7a)[0x3e564fc8aa]
> /usr/lib64/libpmi2.so.0(+0x18ff)[0x7ff671cda8ff]
> /usr/lib64/libpmi2.so.0(+0x1e5b)[0x7ff671cdae5b]
> /usr/lib64/libpmi2.so.0(PMI2_Job_GetId+0x98)[0x7ff671cddc58]
> /hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/openmpi/mca_db_pmi.so(+0x177d)[0x7ff671ef477d]
> /hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/libopen-pal.so.5(opal_db_base_select+0xdd)[0x7ff6737eb33d]
> /hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/openmpi/mca_ess_hnp.so(+0x3266)[0x7ff673123266]
> /hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/lib/libopen-rte.so.5(orte_init+0x17f)[0x7ff673a7251f]
> mpirun(orterun+0x6b5)[0x403e25]
> mpirun(main+0x20)[0x4035c4]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3e5641ecdd]
> mpirun[0x4034e9]
> === Memory map: 
> 0040-0040d000 r-xp  00:18 29289341   
> /hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/bin/orterun
> 0060c000-0060e000 rw-p c000 00:18 29289341   
> /hpc/home/USERS/joshual/ompi_1.7/openmpi-1.7.3rc1/ompi-1.7_install/bin/orterun
> 0060e000-0060f000 rw-p  00:00 0 
> 00b8b000-00ccd000 rw-p  00:00 0  
> [heap]
> 3e55c0-3e55c2 r-xp  08:01 620577 
> /lib64/ld-2.12.so
> 3e55e1f000-3e55e2 r--p 0001f000 08:01 620577 
> /lib64/ld-2.12.so
> 3e55e2-3e55e21000 rw-p 0002 08:01 620577 
> /lib64/ld-2.12.so
> 3e55e21000-3e55e22000 rw-p  00:00 0 
> 3e5600-3e56002000 r-xp  08:01 620581 
> /lib64/libdl-2.12.so
> 3e56002000-3e56202000 ---p 2000 08:01 620581 
> /lib64/libdl-2.12.so
> 3e56202000-3e56203000 r--p 2000 08:01 620581 
> /lib64/libdl-2.12.so
> 3e56203000-3e56204000 rw-p 3000 08:01 620581 
> /lib64/libdl-2.12.so
> 3e5640-3e56597000 r-xp  08:01 620578 
> /lib64/libc-2.12.so
> 3e56597000-3e56797000 ---p 00197000 08:01 620578 
> /lib64/libc-2.12.so
> 3e56797000-3e5679b000 r--p 00197000 08:01 620578 
> /lib64/libc-2.12.so
> 3e5679b000-3e5679c000 rw-p 0019b000 08:01 620578 
> /lib64/libc-2.12.so
> 3e5679c000-3e567a1000 rw-p  00:00 0 
> 3e5680-3e56883000 r-xp  08:01 620585 
> /lib64/libm-2.12.so
> 3e56883000-3e56a82000 ---p 00083000 08:01 620585 
> /lib64/libm-2.12.so
> 3e56a82000-3e56a83000 r--p 00082000 08:01 620585 
> /lib64/libm-2.12.so
> 3e56a83000-3e56a84000 rw-p 00083000 08:01 620585 
> /lib64/libm-2.12.so
> 3e56c0-3e56c17000 r-xp  08:01 620580 
> /lib64/libpthread-2.12.so
> 3e56c17000-3e56e16000 ---p 00017000 08:01 620580 
> /lib64/libpthread-2.12.so
> 3e56e16000-3e56e17000 r--p 00016000 08:01 620580 
> /lib64/libpthread-2.12.so
> 3e56e17000-3e56e18000 rw-p 00017000 08:01 620580 
> /lib64/libpthread-2.12.so
> 3e56e18000-3e56e1c000 rw-p  00:00 0 
> 3e5740-3e57407000 r-xp  08:01 620582 
> /lib64/librt-2.12.so
> 3e57407000-3e57606000 ---p 7000 08:01 620582 
> /lib64/librt-2.12.so
> 3e57606000-3e57607000 r--p 6000 08:01 620582 
> /lib64/librt-2.12.so
> 3e57607000-3e57608000 rw-p 7000 08:01 620582 
> /lib64/librt-2.12.so
> 3e57c0-3e57c08000 r-xp  08:01 228599 
> /usr/lib64/libpciaccess.so.0.10.8
> 3e57c08000-3e57e07000 ---p 8000 08:01 228599 
> /usr/lib64/libpciaccess.so.0.10.8
> 3e57e07000-3e57e08000 rw-p 7000 08:01 228599 
> /usr/lib64/libpciaccess.so.0.10.8
> 3e5a40-3e5a416000 r-xp  08:01 620586 
> /lib64/libgcc_s-4.4.6-20110824.so.1
>