Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Mike Dubman
should be fixed.
thanks


On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd  wrote:

> Yes. Will look into it.
>
> Josh
>
>
> On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Ah; I guess the tags aren't getting pulled over.
>>
>> Mellanox -- can you check into this?
>>
>>
>>
>> On May 12, 2014, at 5:52 PM, "Friedley, Andrew" <
>> andrew.fried...@intel.com> wrote:
>>
>> > Hi,
>> >
>> > I'm looking at the OMPI svn mirror on github and there don't appear to
>> be any release tags for v1.8.x; the most recent appears to be v1.7.2.  Are
>> there any plans to add tags for any more releases?
>> >
>> > Thanks,
>> >
>> > Andrew
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14774.php
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14775.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14777.php
>


Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Jeff Squyres (jsquyres)
Hmm.  The last tag I see on github is still 1.7.2.


On May 13, 2014, at 2:11 AM, Mike Dubman  wrote:

> should be fixed.
> thanks
> 
> 
> On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd  wrote:
> Yes. Will look into it.
> 
> Josh
> 
> 
> On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres)  
> wrote:
> Ah; I guess the tags aren't getting pulled over.
> 
> Mellanox -- can you check into this?
> 
> 
> 
> On May 12, 2014, at 5:52 PM, "Friedley, Andrew"  
> wrote:
> 
> > Hi,
> >
> > I'm looking at the OMPI svn mirror on github and there don't appear to be 
> > any release tags for v1.8.x; the most recent appears to be v1.7.2.  Are 
> > there any plans to add tags for any more releases?
> >
> > Thanks,
> >
> > Andrew
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/05/14774.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14775.php
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14777.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14778.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Friedley, Andrew
I see a v1.8.1, but no v1.8.0, is that correct?

Andrew

> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Tuesday, May 13, 2014 3:15 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] OMPI v1.8.x git tags?
> 
> Hmm.  The last tag I see on github is still 1.7.2.
> 
> 
> On May 13, 2014, at 2:11 AM, Mike Dubman 
> wrote:
> 
> > should be fixed.
> > thanks
> >
> >
> > On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd 
> wrote:
> > Yes. Will look into it.
> >
> > Josh
> >
> >
> > On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres)
>  wrote:
> > Ah; I guess the tags aren't getting pulled over.
> >
> > Mellanox -- can you check into this?
> >
> >
> >
> > On May 12, 2014, at 5:52 PM, "Friedley, Andrew"
>  wrote:
> >
> > > Hi,
> > >
> > > I'm looking at the OMPI svn mirror on github and there don't appear to be
> any release tags for v1.8.x; the most recent appears to be v1.7.2.  Are there
> any plans to add tags for any more releases?
> > >
> > > Thanks,
> > >
> > > Andrew
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/05/14774.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/05/14775.php
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/05/14777.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/05/14778.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/05/14779.php


Re: [OMPI devel] OMPI v1.8.x git tags?

2014-05-13 Thread Jeff Squyres (jsquyres)
I think Mellanox is still working on it.


On May 13, 2014, at 10:57 AM, "Friedley, Andrew"  
wrote:

> I see a v1.8.1, but no v1.8.0, is that correct?
> 
> Andrew
> 
>> -Original Message-
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff
>> Squyres (jsquyres)
>> Sent: Tuesday, May 13, 2014 3:15 AM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] OMPI v1.8.x git tags?
>> 
>> Hmm.  The last tag I see on github is still 1.7.2.
>> 
>> 
>> On May 13, 2014, at 2:11 AM, Mike Dubman 
>> wrote:
>> 
>>> should be fixed.
>>> thanks
>>> 
>>> 
>>> On Tue, May 13, 2014 at 2:53 AM, Joshua Ladd 
>> wrote:
>>> Yes. Will look into it.
>>> 
>>> Josh
>>> 
>>> 
>>> On Mon, May 12, 2014 at 6:01 PM, Jeff Squyres (jsquyres)
>>  wrote:
>>> Ah; I guess the tags aren't getting pulled over.
>>> 
>>> Mellanox -- can you check into this?
>>> 
>>> 
>>> 
>>> On May 12, 2014, at 5:52 PM, "Friedley, Andrew"
>>  wrote:
>>> 
 Hi,
 
 I'm looking at the OMPI svn mirror on github and there don't appear to be
>> any release tags for v1.8.x; the most recent appears to be v1.7.2.  Are there
>> any plans to add tags for any more releases?
 
 Thanks,
 
 Andrew
 ___
 devel mailing list
 de...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
 Link to this post: http://www.open-
>> mpi.org/community/lists/devel/2014/05/14774.php
>>> 
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-
>> mpi.org/community/lists/devel/2014/05/14775.php
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-
>> mpi.org/community/lists/devel/2014/05/14777.php
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-
>> mpi.org/community/lists/devel/2014/05/14778.php
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: http://www.open-
>> mpi.org/community/lists/devel/2014/05/14779.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14780.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Minutes of Open MPI ConCall Meeting - Tuesday, May 13, 2014

2014-05-13 Thread Rolf vandeVaart
Open MPI 1.6:

-  Release was waiting on 
https://svn.open-mpi.org/trac/ompi/ticket/3079 but during meeting we decided it 
was not necessary.  Therefore, Jeff will go ahead and roll Open MPI 1.6.6 RC1.
Open MPI 1.8:

-  Several tickets have been applied.  Some discussion about other 
tickets but details are too numerous to catch here.

-  Still having issues with 0 sized messages and MPI_Alltoallw.  I 
think this is being tracked with ticket 
https://svn.open-mpi.org/trac/ompi/ticket/4506.  Jeff will poke a few folks to 
get things moving for a fix of that issue.

-  Still leaking in some component.  Nathan looking at that issue and 
hopes to have fix soon.

-  Encourage everyone to review their CMRs and change owner after 
review is done.

Other:

RFC: autogen.sh removal is approved.  Bye, bye autogen.sh

Round Table



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[OMPI devel] Non-uniform BTL problems in: openib, tcp, sctp, portals4, vader, scif

2014-05-13 Thread Jeff Squyres (jsquyres)
I notice that BTLs are not checking the return value from ompi_modex_recv() for 
OPAL_ERR_DATA_VALUE_NOT_FOUND (indicating that the peer process didn't put that 
modex key).  In the BTL context, NOT_FOUND means that that peer process doesn't 
have this BTL, so this local peer process should probably mark it as 
unreachable in add_procs().

This is on both trunk and the v1.8 branch.

The BTLs listed above are not checking/handling ompi_modex_recv() returning 
OPAL_ERR_DATA_VALUE_NOT_FOUND properly.  Most of these BTLs do something like 
this:

-
module_add_procs() {
  loop over the peers {
proc = proc_create(...)
if (NULL == proc)
  error!

  }
}

proc_create(...) {
  if (ompi_modex_recv() != OMPI_SUCCESS)
 return NULL;
  ...
}
-

The fix is to make proc_create() return something a bit more expressive so that 
add_procs() can tell the difference between "error!" and "you can't reach this 
peer".

I fixed this in the usnic BTL back in late March, but forgot to bring this to 
everyone's attention -- oops.  See 
https://svn.open-mpi.org/trac/ompi/ticket/4442

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] 1.6.6rc1 tarball posted

2014-05-13 Thread Jeff Squyres (jsquyres)
Now that the 1.8 series is out, we're going to do one final release in the 
1.6.x series, just so that the few bug fixes that came in after 1.6.5 can get 
out into the world (for those who are unable to upgrade to the v1.8 series).

1.6.6rc1 has been posted:

http://www.open-mpi.org/software/ompi/v1.6/

There will be no MS Windows binary version posted for 1.6.6.

Please test!

Here's the list of changes since 1.6.5:

- Prevent integer overflow in datatype creation.  Thanks to Gilles
  Gouaillardet for identifying the problem and providing a preliminary
  version of the patch.
- Ensure help-opal-hwloc-base.txt is included in distribution
  tarballs.  Thanks to Gilles Gouaillardet for supplying the patch.
- Correctly handle the invalid status for NULL and inactive requests.
  Thanks to KAWASHIMA Takahiro for submitting the initial patch.
- Fixed MPI_STATUS_SIZE Fortran issue when used with 8-byte Fortran
  INTEGERs and 4-byte C ints.  Since this issue affects ABI, it is
  only enabled if Open MPI is configured with
  --enable-abi-breaking-fortran-status-i8-fix.  Thanks to Jim Parker
  for supplying the initial patch.
- Fix datatype issue for sending from the middle of non-contiguous
  data.
- Fixed failure error with pty support.  Thanks to Michal Pecio for
  the patch.
- Fixed debugger support for direct-launched jobs.
- Fix MPI_IS_THREAD_MAIN to return the correct value.  Thanks to
  Lisandro Dalcin for pointing out the issue.
- Update VT to 5.14.4.4:
  - Fix C++-11 issue.
  - Fix support for building RPMs on Fedora with CUDA libraries.
- Add openib part number for ConnectX3-Pro HCA.
- Ensure to check that all resolved IP addresses are local.
- Fix MPI_COMM_SPAWN via rsh when mpirun is on a different server.
- Add Gentoo "sandbox" memory hooks override.


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] opal_free_list_t annoyance

2014-05-13 Thread Nathan Hjelm
While tracking down memory leaks in components I ran into an interesting
issue. osc/rdma uses an opal_free_list_t (not an ompi_free_list_t) for
buffer fragments. The fragment class allocates a buffer as part in the
constructor and frees the buffer in the destructor. The problem is that
the item constructor is called but the destructor is never called.

I looked into the issue and I see what is happening. When growing the free
list we call the constructor for each item we allocate (see
opal_free_list.c:113) but the free list destructor does not invoke the
destructor. This is different from ompi_free_list_t which does invoke
the destructor on each constructed item.

The question is. Is this difference intentional? It seems a little odd
that the free list does not call the item destructor given that it
calls the constructor. If this is intentional is there a reason for this
behavior? If not I plan on "fixing" the opal_free_list_t destructor to
call the item destructor.

-Nathan



pgpfmwPgrzPzX.pgp
Description: PGP signature


[OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet
Folks,

i would like to comment on r31738 :

> There is no reason to cancel the listening thread. It should die
> automatically when the file descriptor is closed.
i could not agree more
> It is sufficient to just wait for the thread to exit with pthread join.
unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
is *not* :-(

this is what i described in #4615
https://svn.open-mpi.org/trac/ompi/ticket/4615
in which i attached scif_hang.c that evidences that (at least in my
environment)
scif_poll(...) does *not* return after scif_close(...) is closed, and
hence the scif pthread never ends.

this is likely a bug in MPSS and it might have been fixed in earlier
release.

Nathan, could you try scif_hang in your environment and report the MPSS
version you are running ?


bottom line, and once again, in my test environment, pthread_join (...)
without pthread_cancel(...)
might cause a hang when the btl/scif module is released.


assuming the bug is in old MPSS and has been fixed in recent releases,
what is the OpenMPI policy ?
a) test the MPSS version and call pthread_cancel() or do *not* call
pthread_join if buggy MPSS is detected ?
b) display an error/warning if a buggy MPSS is detected ?
c) do not call pthread_join at all ? /* SIGSEGV might occur with older
MPSS, it is in MPI_Finalize() so impact is limited */
d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
problem after all ?
e) something else ?

Gilles


Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
It could be a bug in the software stack, though I wouldn't count on it. 
Unfortunately, pthread_cancel is known to have bad side effects, and so we 
avoid its use.

The key here is that the thread must detect that the file descriptor has closed 
and exit, or use some other method for detecting that it should terminate. We 
do this in multiple other places in the code, without using pthread_cancel and 
without hanging. So it is certainly doable.

I don't know the specifics of why Nathan's code is having trouble exiting, but 
I suspect that a simple solution - not involving pthread_cancel - can be 
readily developed.


On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
 wrote:

> Folks,
> 
> i would like to comment on r31738 :
> 
>> There is no reason to cancel the listening thread. It should die
>> automatically when the file descriptor is closed.
> i could not agree more
>> It is sufficient to just wait for the thread to exit with pthread join.
> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> is *not* :-(
> 
> this is what i described in #4615
> https://svn.open-mpi.org/trac/ompi/ticket/4615
> in which i attached scif_hang.c that evidences that (at least in my
> environment)
> scif_poll(...) does *not* return after scif_close(...) is closed, and
> hence the scif pthread never ends.
> 
> this is likely a bug in MPSS and it might have been fixed in earlier
> release.
> 
> Nathan, could you try scif_hang in your environment and report the MPSS
> version you are running ?
> 
> 
> bottom line, and once again, in my test environment, pthread_join (...)
> without pthread_cancel(...)
> might cause a hang when the btl/scif module is released.
> 
> 
> assuming the bug is in old MPSS and has been fixed in recent releases,
> what is the OpenMPI policy ?
> a) test the MPSS version and call pthread_cancel() or do *not* call
> pthread_join if buggy MPSS is detected ?
> b) display an error/warning if a buggy MPSS is detected ?
> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> MPSS, it is in MPI_Finalize() so impact is limited */
> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> problem after all ?
> e) something else ?
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php



Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread George Bosilca
I heard multiple references to pthread_cancel being known to have bad
side effects. Can somebody educate my on this topic please?

  Thanks,
George.



On Tue, May 13, 2014 at 10:25 PM, Ralph Castain  wrote:
> It could be a bug in the software stack, though I wouldn't count on it. 
> Unfortunately, pthread_cancel is known to have bad side effects, and so we 
> avoid its use.
>
> The key here is that the thread must detect that the file descriptor has 
> closed and exit, or use some other method for detecting that it should 
> terminate. We do this in multiple other places in the code, without using 
> pthread_cancel and without hanging. So it is certainly doable.
>
> I don't know the specifics of why Nathan's code is having trouble exiting, 
> but I suspect that a simple solution - not involving pthread_cancel - can be 
> readily developed.
>
>
> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
>  wrote:
>
>> Folks,
>>
>> i would like to comment on r31738 :
>>
>>> There is no reason to cancel the listening thread. It should die
>>> automatically when the file descriptor is closed.
>> i could not agree more
>>> It is sufficient to just wait for the thread to exit with pthread join.
>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>> is *not* :-(
>>
>> this is what i described in #4615
>> https://svn.open-mpi.org/trac/ompi/ticket/4615
>> in which i attached scif_hang.c that evidences that (at least in my
>> environment)
>> scif_poll(...) does *not* return after scif_close(...) is closed, and
>> hence the scif pthread never ends.
>>
>> this is likely a bug in MPSS and it might have been fixed in earlier
>> release.
>>
>> Nathan, could you try scif_hang in your environment and report the MPSS
>> version you are running ?
>>
>>
>> bottom line, and once again, in my test environment, pthread_join (...)
>> without pthread_cancel(...)
>> might cause a hang when the btl/scif module is released.
>>
>>
>> assuming the bug is in old MPSS and has been fixed in recent releases,
>> what is the OpenMPI policy ?
>> a) test the MPSS version and call pthread_cancel() or do *not* call
>> pthread_join if buggy MPSS is detected ?
>> b) display an error/warning if a buggy MPSS is detected ?
>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>> MPSS, it is in MPI_Finalize() so impact is limited */
>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>> problem after all ?
>> e) something else ?
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php


Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Gilles Gouaillardet
Ralph,

scif_poll(...) is called with an infinite timeout.

a quick fix would be to use a finite timeout (1s ? 10s ? more ?)
the obvious drawback is the thread has to wake up every xxx seconds and
that would be for
nothing 99.9% of the time.

my analysis (see #4615) is the crash occurs when the btl/scif is
unloaded from memory (e.g. dlcose()) and
the scif_thread is still running.

Gilles

On 2014/05/14 11:25, Ralph Castain wrote:
> It could be a bug in the software stack, though I wouldn't count on it. 
> Unfortunately, pthread_cancel is known to have bad side effects, and so we 
> avoid its use.
>
> The key here is that the thread must detect that the file descriptor has 
> closed and exit, or use some other method for detecting that it should 
> terminate. We do this in multiple other places in the code, without using 
> pthread_cancel and without hanging. So it is certainly doable.
>
> I don't know the specifics of why Nathan's code is having trouble exiting, 
> but I suspect that a simple solution - not involving pthread_cancel - can be 
> readily developed.
>
>
> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
>  wrote:
>
>> Folks,
>>
>> i would like to comment on r31738 :
>>
>>> There is no reason to cancel the listening thread. It should die
>>> automatically when the file descriptor is closed.
>> i could not agree more
>>> It is sufficient to just wait for the thread to exit with pthread join.
>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>> is *not* :-(
>>
>> this is what i described in #4615
>> https://svn.open-mpi.org/trac/ompi/ticket/4615
>> in which i attached scif_hang.c that evidences that (at least in my
>> environment)
>> scif_poll(...) does *not* return after scif_close(...) is closed, and
>> hence the scif pthread never ends.
>>
>> this is likely a bug in MPSS and it might have been fixed in earlier
>> release.
>>
>> Nathan, could you try scif_hang in your environment and report the MPSS
>> version you are running ?
>>
>>
>> bottom line, and once again, in my test environment, pthread_join (...)
>> without pthread_cancel(...)
>> might cause a hang when the btl/scif module is released.
>>
>>
>> assuming the bug is in old MPSS and has been fixed in recent releases,
>> what is the OpenMPI policy ?
>> a) test the MPSS version and call pthread_cancel() or do *not* call
>> pthread_join if buggy MPSS is detected ?
>> b) display an error/warning if a buggy MPSS is detected ?
>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>> MPSS, it is in MPI_Finalize() so impact is limited */
>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>> problem after all ?
>> e) something else ?
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php



Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Paul Hargrove
George,

Just my USD0.02:

With pthreads many system calls (mostly those that might block) become
"cancellation points" where the implementation checks if the callinf thread
has been cancelled.
This means that a thread making any of those calls may simply never return
(calling pthread_exit() internally), unless extra work has been done to
prevent this default behavior.
This makes it very hard to write code that properly cleans up its
resources, including (but not limited to) file descriptors and malloc()ed
memory.
Even if Open MPI is written very carefully, one cannot assume that all the
libraries it calls (and their dependencies, etc.) are written to properly
deal with cancellation.

-Paul


On Tue, May 13, 2014 at 7:32 PM, George Bosilca  wrote:

> I heard multiple references to pthread_cancel being known to have bad
> side effects. Can somebody educate my on this topic please?
>
>   Thanks,
> George.
>
>
>
> On Tue, May 13, 2014 at 10:25 PM, Ralph Castain  wrote:
> > It could be a bug in the software stack, though I wouldn't count on it.
> Unfortunately, pthread_cancel is known to have bad side effects, and so we
> avoid its use.
> >
> > The key here is that the thread must detect that the file descriptor has
> closed and exit, or use some other method for detecting that it should
> terminate. We do this in multiple other places in the code, without using
> pthread_cancel and without hanging. So it is certainly doable.
> >
> > I don't know the specifics of why Nathan's code is having trouble
> exiting, but I suspect that a simple solution - not involving
> pthread_cancel - can be readily developed.
> >
> >
> > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> >
> >> Folks,
> >>
> >> i would like to comment on r31738 :
> >>
> >>> There is no reason to cancel the listening thread. It should die
> >>> automatically when the file descriptor is closed.
> >> i could not agree more
> >>> It is sufficient to just wait for the thread to exit with pthread join.
> >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> >> is *not* :-(
> >>
> >> this is what i described in #4615
> >> https://svn.open-mpi.org/trac/ompi/ticket/4615
> >> in which i attached scif_hang.c that evidences that (at least in my
> >> environment)
> >> scif_poll(...) does *not* return after scif_close(...) is closed, and
> >> hence the scif pthread never ends.
> >>
> >> this is likely a bug in MPSS and it might have been fixed in earlier
> >> release.
> >>
> >> Nathan, could you try scif_hang in your environment and report the MPSS
> >> version you are running ?
> >>
> >>
> >> bottom line, and once again, in my test environment, pthread_join (...)
> >> without pthread_cancel(...)
> >> might cause a hang when the btl/scif module is released.
> >>
> >>
> >> assuming the bug is in old MPSS and has been fixed in recent releases,
> >> what is the OpenMPI policy ?
> >> a) test the MPSS version and call pthread_cancel() or do *not* call
> >> pthread_join if buggy MPSS is detected ?
> >> b) display an error/warning if a buggy MPSS is detected ?
> >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> >> MPSS, it is in MPI_Finalize() so impact is limited */
> >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> >> problem after all ?
> >> e) something else ?
> >>
> >> Gilles
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14788.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
As I said, this isn't the only thread that faces this issue, and we have 
resolved it elsewhere - surely we can resolve it here as well in an acceptable 
manner.

Nathan?


On May 13, 2014, at 7:33 PM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> scif_poll(...) is called with an infinite timeout.
> 
> a quick fix would be to use a finite timeout (1s ? 10s ? more ?)
> the obvious drawback is the thread has to wake up every xxx seconds and
> that would be for
> nothing 99.9% of the time.
> 
> my analysis (see #4615) is the crash occurs when the btl/scif is
> unloaded from memory (e.g. dlcose()) and
> the scif_thread is still running.
> 
> Gilles
> 
> On 2014/05/14 11:25, Ralph Castain wrote:
>> It could be a bug in the software stack, though I wouldn't count on it. 
>> Unfortunately, pthread_cancel is known to have bad side effects, and so we 
>> avoid its use.
>> 
>> The key here is that the thread must detect that the file descriptor has 
>> closed and exit, or use some other method for detecting that it should 
>> terminate. We do this in multiple other places in the code, without using 
>> pthread_cancel and without hanging. So it is certainly doable.
>> 
>> I don't know the specifics of why Nathan's code is having trouble exiting, 
>> but I suspect that a simple solution - not involving pthread_cancel - can be 
>> readily developed.
>> 
>> 
>> On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Folks,
>>> 
>>> i would like to comment on r31738 :
>>> 
 There is no reason to cancel the listening thread. It should die
 automatically when the file descriptor is closed.
>>> i could not agree more
 It is sufficient to just wait for the thread to exit with pthread join.
>>> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
>>> is *not* :-(
>>> 
>>> this is what i described in #4615
>>> https://svn.open-mpi.org/trac/ompi/ticket/4615
>>> in which i attached scif_hang.c that evidences that (at least in my
>>> environment)
>>> scif_poll(...) does *not* return after scif_close(...) is closed, and
>>> hence the scif pthread never ends.
>>> 
>>> this is likely a bug in MPSS and it might have been fixed in earlier
>>> release.
>>> 
>>> Nathan, could you try scif_hang in your environment and report the MPSS
>>> version you are running ?
>>> 
>>> 
>>> bottom line, and once again, in my test environment, pthread_join (...)
>>> without pthread_cancel(...)
>>> might cause a hang when the btl/scif module is released.
>>> 
>>> 
>>> assuming the bug is in old MPSS and has been fixed in recent releases,
>>> what is the OpenMPI policy ?
>>> a) test the MPSS version and call pthread_cancel() or do *not* call
>>> pthread_join if buggy MPSS is detected ?
>>> b) display an error/warning if a buggy MPSS is detected ?
>>> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
>>> MPSS, it is in MPI_Finalize() so impact is limited */
>>> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
>>> problem after all ?
>>> e) something else ?
>>> 
>>> Gilles
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14789.php



Re: [OMPI devel] about btl/scif thread cancellation (#4616 / r31738)

2014-05-13 Thread Ralph Castain
+1 - seen it before, and you'll find warnings across many software sites about 
this problem. Easy to have the main program segfault by touching the wrong 
thing after a cancel unless all the stars are properly aligned in the various 
libraries.



On May 13, 2014, at 7:56 PM, Paul Hargrove  wrote:

> George,
> 
> Just my USD0.02:
> 
> With pthreads many system calls (mostly those that might block) become 
> "cancellation points" where the implementation checks if the callinf thread 
> has been cancelled.
> This means that a thread making any of those calls may simply never return 
> (calling pthread_exit() internally), unless extra work has been done to 
> prevent this default behavior.
> This makes it very hard to write code that properly cleans up its resources, 
> including (but not limited to) file descriptors and malloc()ed memory.
> Even if Open MPI is written very carefully, one cannot assume that all the 
> libraries it calls (and their dependencies, etc.) are written to properly 
> deal with cancellation.
> 
> -Paul
> 
> 
> On Tue, May 13, 2014 at 7:32 PM, George Bosilca  wrote:
> I heard multiple references to pthread_cancel being known to have bad
> side effects. Can somebody educate my on this topic please?
> 
>   Thanks,
> George.
> 
> 
> 
> On Tue, May 13, 2014 at 10:25 PM, Ralph Castain  wrote:
> > It could be a bug in the software stack, though I wouldn't count on it. 
> > Unfortunately, pthread_cancel is known to have bad side effects, and so we 
> > avoid its use.
> >
> > The key here is that the thread must detect that the file descriptor has 
> > closed and exit, or use some other method for detecting that it should 
> > terminate. We do this in multiple other places in the code, without using 
> > pthread_cancel and without hanging. So it is certainly doable.
> >
> > I don't know the specifics of why Nathan's code is having trouble exiting, 
> > but I suspect that a simple solution - not involving pthread_cancel - can 
> > be readily developed.
> >
> >
> > On May 13, 2014, at 7:18 PM, Gilles Gouaillardet 
> >  wrote:
> >
> >> Folks,
> >>
> >> i would like to comment on r31738 :
> >>
> >>> There is no reason to cancel the listening thread. It should die
> >>> automatically when the file descriptor is closed.
> >> i could not agree more
> >>> It is sufficient to just wait for the thread to exit with pthread join.
> >> unfortunatly, at least in my test environment (an outdated MPSS 2.1) it
> >> is *not* :-(
> >>
> >> this is what i described in #4615
> >> https://svn.open-mpi.org/trac/ompi/ticket/4615
> >> in which i attached scif_hang.c that evidences that (at least in my
> >> environment)
> >> scif_poll(...) does *not* return after scif_close(...) is closed, and
> >> hence the scif pthread never ends.
> >>
> >> this is likely a bug in MPSS and it might have been fixed in earlier
> >> release.
> >>
> >> Nathan, could you try scif_hang in your environment and report the MPSS
> >> version you are running ?
> >>
> >>
> >> bottom line, and once again, in my test environment, pthread_join (...)
> >> without pthread_cancel(...)
> >> might cause a hang when the btl/scif module is released.
> >>
> >>
> >> assuming the bug is in old MPSS and has been fixed in recent releases,
> >> what is the OpenMPI policy ?
> >> a) test the MPSS version and call pthread_cancel() or do *not* call
> >> pthread_join if buggy MPSS is detected ?
> >> b) display an error/warning if a buggy MPSS is detected ?
> >> c) do not call pthread_join at all ? /* SIGSEGV might occur with older
> >> MPSS, it is in MPI_Finalize() so impact is limited */
> >> d) do nothing, let the btl/scif module hang, this is *not* an OpenMPI
> >> problem after all ?
> >> e) something else ?
> >>
> >> Gilles
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14786.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/05/14787.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14788.php
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14790.ph