date:20140822

Re: [OMPI devel] 1.8.2rc5 released

2014-08-22 Thread Ralph Castain

Looks okay - good to go


On Aug 22, 2014, at 12:09 PM, Jeff Squyres (jsquyres)  
wrote:

> No -- most of these were not user-visible, or they were fixes from fixes 
> post-1.8.1.
> 
> I think the relevant ones were put in NEWS already.  I'm recording a podcast 
> right now -- can you double check?
> 
> 
> 
> On Aug 22, 2014, at 2:42 PM, Ralph Castain  wrote:
> 
>> Did you update the NEWS with these?
>> 
>> On Aug 22, 2014, at 11:33 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>>> In the usual location:
>>> 
>>>  http://www.open-mpi.org/software/ompi/v1.8/
>>> 
>>> Changes since rc4:
>>> 
>>> - Add a missing atomics stuff into tarball
>>> - fortran: add missing bindings for WIN_SYNC, WIN_LOCK_ALL, WIN_UNLOCK_ALL
>>> - README updates
>>> - usnic: ensure to have a safe destruction of an opal_list_item_t
>>> - remove deprecation warnings for pernode, npernode, and npersocket
>>> - OOB updates: if an address fails, just try the next one
>>> - usnic: Fix connectivity checker pointer mismatch
>>> - btl/scif: use safe syntax
>>> - openib/btl: better detect max reg memory.
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15694.php
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15695.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15696.php

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-22 Thread Ralph Castain

I think these are fixed now - at least, your test cases all pass for me


On Aug 22, 2014, at 9:12 AM, Ralph Castain  wrote:

> 
> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet 
>  wrote:
> 
>> Ralph,
>> 
>> Will do on Monday
>> 
>> About the first test, in my case echo $? returns 0
> 
> My "showcode" is just an alias for the echo
> 
>> I noticed this confusing message in your output :
>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>> signal 0 (Unknown signal 0).
> 
> I'll take a look at why that happened
> 
>> 
>> About the second test, please note my test program return 3;
>> whereas your mpi_no_op.c return 0;
> 
> I didn't see that little cuteness - sigh
> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ralph Castain  wrote:
>> You might want to try again with current head of trunk as something seems 
>> off in what you are seeing - more below
>> 
>> 
>> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Ralph,
>>> 
>>> i tried again after the merge and found the same behaviour, though the
>>> internals are very different.
>>> 
>>> i run without any batch manager
>>> 
>>> from node0:
>>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
>>> 
>>> exit with exit code zero :-(
>> 
>> Hmmm...it works fine for me, without your patch:
>> 
>> 07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
>> Hello, World, I am 0 of 1
>> --
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>> with errorcode 2.
>> 
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --
>> --
>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>> signal 0 (Unknown signal 0).
>> --
>> 07:35:56  $ showcode
>> 130
>> 
>>> 
>>> short story : i applied pmix.2.patch and that fixed my problem
>>> could you please review this ?
>>> 
>>> long story :
>>> i initially applied pmix.1.patch and it solved my problem
>>> then i ran
>>> mpirun -np 1 --mca btl openib,self -host node1 ./abort
>>> and i came back to square one : exit code is zero
>>> so i used the debugger and was unable to reproduce the issue
>>> (one more race condition, yeah !)
>>> finally, i wrote pmix.2.patch, fixed my issue and realized that
>>> pmix.1.patch was no more needed.
>>> currently, and assuming pmix.2.patch is correct, i cannot tell wether
>>> pmix.1.patch is needed or not
>>> since this part of the code is no more executed.
>>> 
>>> i also found one hang with the following trivial program within one node :
>>> 
>>> int main (int argc, char *argv[]) {
>>> MPI_Init(, );
>>>MPI_Finalize();
>>>return 3;
>>> }
>>> 
>>> from node0 :
>>> $ mpirun -np 1 ./test
>>> ---
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> ---
>>> 
>>> AND THE PROGRAM HANGS
>> 
>> This also works fine for me:
>> 
>> 07:37:27  $ mpirun -n 1 ./mpi_no_op
>> 07:37:36  $ cat mpi_no_op.c
>> /* -*- C -*-
>>  *
>>  * $HEADER$
>>  *
>>  * The most basic of MPI applications
>>  */
>> 
>> #include 
>> #include "mpi.h"
>> 
>> int main(int argc, char* argv[])
>> {
>> MPI_Init(, );
>> 
>> MPI_Finalize();
>> return 0;
>> }
>> 
>> 
>>> 
>>> *but*
>>> $ mpirun -np 1 -host node1 ./test
>>> ---
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> ---
>>> --
>>> mpirun detected that one or more processes exited with non-zero status,
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>>  Process name: [[22080,1],0]
>>>  Exit code:3
>>> --
>>> 
>>> return with exit code 3.
>> 
>> Likewise here - works just fine for me
>> 
>> 
>>> 
>>> then i found a strange behaviour with helloworld if only the self btl is
>>> used :
>>> $ mpirun -np 1 --mca btl self ./hw
>>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
>>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
>>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
>>> line 722
>>> 
>>> the program returns with exit code zero, but display

Re: [OMPI devel] 1.8.2rc5 released

2014-08-22 Thread Jeff Squyres (jsquyres)

No -- most of these were not user-visible, or they were fixes from fixes 
post-1.8.1.

I think the relevant ones were put in NEWS already.  I'm recording a podcast 
right now -- can you double check?



On Aug 22, 2014, at 2:42 PM, Ralph Castain  wrote:

> Did you update the NEWS with these?
> 
> On Aug 22, 2014, at 11:33 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> In the usual location:
>> 
>>   http://www.open-mpi.org/software/ompi/v1.8/
>> 
>> Changes since rc4:
>> 
>> - Add a missing atomics stuff into tarball
>> - fortran: add missing bindings for WIN_SYNC, WIN_LOCK_ALL, WIN_UNLOCK_ALL
>> - README updates
>> - usnic: ensure to have a safe destruction of an opal_list_item_t
>> - remove deprecation warnings for pernode, npernode, and npersocket
>> - OOB updates: if an address fails, just try the next one
>> - usnic: Fix connectivity checker pointer mismatch
>> - btl/scif: use safe syntax
>> - openib/btl: better detect max reg memory.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15694.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15695.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.8.2rc5 released

2014-08-22 Thread Ralph Castain

Did you update the NEWS with these?

On Aug 22, 2014, at 11:33 AM, Jeff Squyres (jsquyres)  
wrote:

> In the usual location:
> 
>http://www.open-mpi.org/software/ompi/v1.8/
> 
> Changes since rc4:
> 
> - Add a missing atomics stuff into tarball
> - fortran: add missing bindings for WIN_SYNC, WIN_LOCK_ALL, WIN_UNLOCK_ALL
> - README updates
> - usnic: ensure to have a safe destruction of an opal_list_item_t
> - remove deprecation warnings for pernode, npernode, and npersocket
> - OOB updates: if an address fails, just try the next one
> - usnic: Fix connectivity checker pointer mismatch
> - btl/scif: use safe syntax
> - openib/btl: better detect max reg memory.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15694.php

[OMPI devel] 1.8.2rc5 released

2014-08-22 Thread Jeff Squyres (jsquyres)

In the usual location:

http://www.open-mpi.org/software/ompi/v1.8/

Changes since rc4:

- Add a missing atomics stuff into tarball
- fortran: add missing bindings for WIN_SYNC, WIN_LOCK_ALL, WIN_UNLOCK_ALL
- README updates
- usnic: ensure to have a safe destruction of an opal_list_item_t
- remove deprecation warnings for pernode, npernode, and npersocket
- OOB updates: if an address fails, just try the next one
- usnic: Fix connectivity checker pointer mismatch
- btl/scif: use safe syntax
- openib/btl: better detect max reg memory.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-22 Thread Ralph Castain


On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> Will do on Monday
> 
> About the first test, in my case echo $? returns 0

My "showcode" is just an alias for the echo

> I noticed this confusing message in your output :
> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
> signal 0 (Unknown signal 0).

I'll take a look at why that happened

> 
> About the second test, please note my test program return 3;
> whereas your mpi_no_op.c return 0;

I didn't see that little cuteness - sigh

> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain  wrote:
> You might want to try again with current head of trunk as something seems off 
> in what you are seeing - more below
> 
> 
> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
>  wrote:
> 
>> Ralph,
>> 
>> i tried again after the merge and found the same behaviour, though the
>> internals are very different.
>> 
>> i run without any batch manager
>> 
>> from node0:
>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
>> 
>> exit with exit code zero :-(
> 
> Hmmm...it works fine for me, without your patch:
> 
> 07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
> Hello, World, I am 0 of 1
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> --
> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
> signal 0 (Unknown signal 0).
> --
> 07:35:56  $ showcode
> 130
> 
>> 
>> short story : i applied pmix.2.patch and that fixed my problem
>> could you please review this ?
>> 
>> long story :
>> i initially applied pmix.1.patch and it solved my problem
>> then i ran
>> mpirun -np 1 --mca btl openib,self -host node1 ./abort
>> and i came back to square one : exit code is zero
>> so i used the debugger and was unable to reproduce the issue
>> (one more race condition, yeah !)
>> finally, i wrote pmix.2.patch, fixed my issue and realized that
>> pmix.1.patch was no more needed.
>> currently, and assuming pmix.2.patch is correct, i cannot tell wether
>> pmix.1.patch is needed or not
>> since this part of the code is no more executed.
>> 
>> i also found one hang with the following trivial program within one node :
>> 
>> int main (int argc, char *argv[]) {
>> MPI_Init(, );
>>MPI_Finalize();
>>return 3;
>> }
>> 
>> from node0 :
>> $ mpirun -np 1 ./test
>> ---
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> ---
>> 
>> AND THE PROGRAM HANGS
> 
> This also works fine for me:
> 
> 07:37:27  $ mpirun -n 1 ./mpi_no_op
> 07:37:36  $ cat mpi_no_op.c
> /* -*- C -*-
>  *
>  * $HEADER$
>  *
>  * The most basic of MPI applications
>  */
> 
> #include 
> #include "mpi.h"
> 
> int main(int argc, char* argv[])
> {
> MPI_Init(, );
> 
> MPI_Finalize();
> return 0;
> }
> 
> 
>> 
>> *but*
>> $ mpirun -np 1 -host node1 ./test
>> ---
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code.. Per user-direction, the job has been aborted.
>> ---
>> --
>> mpirun detected that one or more processes exited with non-zero status,
>> thus causing
>> the job to be terminated. The first process to do so was:
>> 
>>  Process name: [[22080,1],0]
>>  Exit code:3
>> --
>> 
>> return with exit code 3.
> 
> Likewise here - works just fine for me
> 
> 
>> 
>> then i found a strange behaviour with helloworld if only the self btl is
>> used :
>> $ mpirun -np 1 --mca btl self ./hw
>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
>> line 722
>> 
>> the program returns with exit code zero, but display an error message.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/08/21 6:21, Ralph Castain wrote:
>>> I'm aware of the problem, but it will be fixed when the PMIx branch is 
>>> merged later this week.
>>> 
>>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
>>>  wrote:
>>>

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-22 Thread Gilles Gouaillardet

Ralph,

Will do on Monday

About the first test, in my case echo $? returns 0
I noticed this confusing message in your output :
mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
signal 0 (Unknown signal 0).

About the second test, please note my test program return 3;
whereas your mpi_no_op.c return 0;

Cheers,

Gilles

Ralph Castain  wrote:
>You might want to try again with current head of trunk as something seems off 
>in what you are seeing - more below
>
>
>
>On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
> wrote:
>
>
>Ralph,
>
>i tried again after the merge and found the same behaviour, though the
>internals are very different.
>
>i run without any batch manager
>
>from node0:
>mpirun -np 1 --mca btl tcp,self -host node1 ./abort
>
>exit with exit code zero :-(
>
>
>Hmmm...it works fine for me, without your patch:
>
>
>07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
>
>Hello, World, I am 0 of 1
>
>--
>
>MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>
>with errorcode 2.
>
>
>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>
>You may or may not see output from other processes, depending on
>
>exactly when Open MPI kills them.
>
>--
>
>--
>
>mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>signal 0 (Unknown signal 0).
>
>--
>
>07:35:56  $ showcode
>
>130
>
>
>
>short story : i applied pmix.2.patch and that fixed my problem
>could you please review this ?
>
>long story :
>i initially applied pmix.1.patch and it solved my problem
>then i ran
>mpirun -np 1 --mca btl openib,self -host node1 ./abort
>and i came back to square one : exit code is zero
>so i used the debugger and was unable to reproduce the issue
>(one more race condition, yeah !)
>finally, i wrote pmix.2.patch, fixed my issue and realized that
>pmix.1.patch was no more needed.
>currently, and assuming pmix.2.patch is correct, i cannot tell wether
>pmix.1.patch is needed or not
>since this part of the code is no more executed.
>
>i also found one hang with the following trivial program within one node :
>
>int main (int argc, char *argv[]) {
>MPI_Init(, );
>   MPI_Finalize();
>   return 3;
>}
>
>from node0 :
>$ mpirun -np 1 ./test
>---
>Primary job  terminated normally, but 1 process returned
>a non-zero exit code.. Per user-direction, the job has been aborted.
>---
>
>AND THE PROGRAM HANGS
>
>
>This also works fine for me:
>
>
>07:37:27  $ mpirun -n 1 ./mpi_no_op
>
>07:37:36  $ cat mpi_no_op.c
>
>/* -*- C -*-
>
> *
>
> * $HEADER$
>
> *
>
> * The most basic of MPI applications
>
> */
>
>
>#include 
>
>#include "mpi.h"
>
>
>int main(int argc, char* argv[])
>
>{
>
>    MPI_Init(, );
>
>
>    MPI_Finalize();
>
>    return 0;
>
>}
>
>
>
>
>*but*
>$ mpirun -np 1 -host node1 ./test
>---
>Primary job  terminated normally, but 1 process returned
>a non-zero exit code.. Per user-direction, the job has been aborted.
>---
>--
>mpirun detected that one or more processes exited with non-zero status,
>thus causing
>the job to be terminated. The first process to do so was:
>
> Process name: [[22080,1],0]
> Exit code:    3
>--
>
>return with exit code 3.
>
>
>Likewise here - works just fine for me
>
>
>
>
>then i found a strange behaviour with helloworld if only the self btl is
>used :
>$ mpirun -np 1 --mca btl self ./hw
>[helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
>[helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
>file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
>line 722
>
>the program returns with exit code zero, but display an error message.
>
>Cheers,
>
>Gilles
>
>On 2014/08/21 6:21, Ralph Castain wrote:
>
>I'm aware of the problem, but it will be fixed when the PMIx branch is merged 
>later this week.
>
>On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
> wrote:
>
>Folks,
>
>let's look at the following trivial test program :
>
>#include 
>#include 
>
>int main (int argc, char * argv[]) {
>  int rank, size;
>  MPI_Init(, );
>  MPI_Comm_size(MPI_COMM_WORLD, );
>  MPI_Comm_rank(MPI_COMM_WORLD, );
>  printf ("I am %d/%d and i abort\n", rank, size);
>  MPI_Abort(MPI_COMM_WORLD, 2);
>  printf ("%d/%d aborted !\n", rank, size);
>  return 3;
>}
>
>and let's run mpirun (trunk)

Re: [OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-22 Thread Ralph Castain

You might want to try again with current head of trunk as something seems off 
in what you are seeing - more below


On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> i tried again after the merge and found the same behaviour, though the
> internals are very different.
> 
> i run without any batch manager
> 
> from node0:
> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
> 
> exit with exit code zero :-(

Hmmm...it works fine for me, without your patch:

07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
Hello, World, I am 0 of 1
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
--
mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
signal 0 (Unknown signal 0).
--
07:35:56  $ showcode
130

> 
> short story : i applied pmix.2.patch and that fixed my problem
> could you please review this ?
> 
> long story :
> i initially applied pmix.1.patch and it solved my problem
> then i ran
> mpirun -np 1 --mca btl openib,self -host node1 ./abort
> and i came back to square one : exit code is zero
> so i used the debugger and was unable to reproduce the issue
> (one more race condition, yeah !)
> finally, i wrote pmix.2.patch, fixed my issue and realized that
> pmix.1.patch was no more needed.
> currently, and assuming pmix.2.patch is correct, i cannot tell wether
> pmix.1.patch is needed or not
> since this part of the code is no more executed.
> 
> i also found one hang with the following trivial program within one node :
> 
> int main (int argc, char *argv[]) {
> MPI_Init(, );
>MPI_Finalize();
>return 3;
> }
> 
> from node0 :
> $ mpirun -np 1 ./test
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> 
> AND THE PROGRAM HANGS

This also works fine for me:

07:37:27  $ mpirun -n 1 ./mpi_no_op
07:37:36  $ cat mpi_no_op.c
/* -*- C -*-
 *
 * $HEADER$
 *
 * The most basic of MPI applications
 */

#include 
#include "mpi.h"

int main(int argc, char* argv[])
{
MPI_Init(, );

MPI_Finalize();
return 0;
}


> 
> *but*
> $ mpirun -np 1 -host node1 ./test
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
> 
>  Process name: [[22080,1],0]
>  Exit code:3
> --
> 
> return with exit code 3.

Likewise here - works just fine for me


> 
> then i found a strange behaviour with helloworld if only the self btl is
> used :
> $ mpirun -np 1 --mca btl self ./hw
> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
> line 722
> 
> the program returns with exit code zero, but display an error message.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/21 6:21, Ralph Castain wrote:
>> I'm aware of the problem, but it will be fixed when the PMIx branch is 
>> merged later this week.
>> 
>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
>>  wrote:
>> 
>>> Folks,
>>> 
>>> let's look at the following trivial test program :
>>> 
>>> #include 
>>> #include 
>>> 
>>> int main (int argc, char * argv[]) {
>>>   int rank, size;
>>>   MPI_Init(, );
>>>   MPI_Comm_size(MPI_COMM_WORLD, );
>>>   MPI_Comm_rank(MPI_COMM_WORLD, );
>>>   printf ("I am %d/%d and i abort\n", rank, size);
>>>   MPI_Abort(MPI_COMM_WORLD, 2);
>>>   printf ("%d/%d aborted !\n", rank, size);
>>>   return 3;
>>> }
>>> 
>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>>> task 1 :
>>> with two tasks or more :
>>> 
>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>> --
>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>> with errorcode 2.
>>> 
>>> NOTE: invoking MPI_ABORT causes Open MPI to kill

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-22 Thread Andrej Prsa

Hi again,

I generated a video that demonstrates the problem; for brevity I did
not run a full process, but I'm providing the timing below. If you'd
like me to record a full process, just let me know -- but as I said in
my previous email, 32 procs drop to 1 after about a minute and the
computation then rests on a single processor to complete the job.

With openmpi-1.6.5:

real1m13.186s
user0m0.044s
sys 0m0.059s

With openmpi-1.8.2rc4:

real13m42.998s
user0m0.070s
sys 0m0.066s

Exact invocation both times, exact same job submitted. Here's a link to
the video:

http://clusty.ast.villanova.edu/aprsa/files/test.ogv

Please let me know if I can provide you with anything further.

Thanks,
Andrej

> Ah, that sheds some light. There is indeed a significant change
> between earlier releases and the 1.8.1 and above that might explain
> what he is seeing. Specifically, we no longer hammer the cpu while in
> MPI_Finalize. So if 16 of the procs are finishing early (which the
> output would suggest), then they will go into a "lazy" finalize state
> while they wait for the rest of the procs to complete their work.
> 
> In contrast, prior releases would continue at 100% cpu while they
> polled to see if the other procs were done.
> 
> We did this to help save power/energy, and because users had asked
> why the cpu utilization remained at 100% even though procs were
> waiting in finalize
> 
> HTH
> Ralph
> 
> On Aug 21, 2014, at 5:55 PM, Christopher Samuel
>  wrote:
> 
> > On 22/08/14 10:43, Ralph Castain wrote:
> > 
> >> From your earlier concerns, I would have expected only to find 32
> >> of them running. Was that not the case in this run?
> > 
> > As I understand it in his original email he mentioned that with
> > 1.6.5 all 48 processes were running at 100% CPU and was wondering
> > if the buggy BIOS that caused hwloc the issues he reported on the
> > hwloc-users list might be the cause for this regression in
> > performance.
> > 
> > All the best,
> > Chris
> > -- 
> > Christopher SamuelSenior Systems Administrator
> > VLSCI - Victorian Life Sciences Computation Initiative
> > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> > http://www.vlsci.org.au/  http://twitter.com/vlsci
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15686.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15687.php

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-22 Thread Andrej Prsa

Hi Ralph, Chris,

You guys are both correct:

(1) The output that I passed along /is/ exemplary of only 32 processors
running (provided htop reports things correctly). The job I
submitted is the exact same process called 48 times (well, np
times), so all procs should take about the same time, ~1 minute.
The execution is notably slower than with 1.6.5 (I will time it
shortly, but offhand I'd say it's ~5x slower), and it seems that,
for the fraction of the time, 32 processors do all the work, and
then 1 processor finishes the remaining work -- i.e. htop shows 32
procs working, 16 idling, then 32 goes down to 1 and stays that way
for a while, then it drops to 0 and the job finishes. This behavior
is apparent in /all/ mpi jobs, not just this particular test case.

(2) I suspected that hwloc might be a culprit; before I posted here, I
reported it on hwloc mailing list, where I was told that it seems
to be a cache reporting problem and that I should be fine ignoring
it, or that I should load the topology from XML. I figured I'd
mention the buggy bios in my first post just in case it rang any
bells.

Is there a way to add timestamps to the debug output? That may
demonstrate better what I'm trying to say in (1) above.

If it helps, I'd be more than happy to provide access to the affected
machine so that you can see what's going on first hand, or capture a
small movie of htop while the process is running.

Thanks,
Andrej

> Ah, that sheds some light. There is indeed a significant change
> between earlier releases and the 1.8.1 and above that might explain
> what he is seeing. Specifically, we no longer hammer the cpu while in
> MPI_Finalize. So if 16 of the procs are finishing early (which the
> output would suggest), then they will go into a "lazy" finalize state
> while they wait for the rest of the procs to complete their work.
> 
> In contrast, prior releases would continue at 100% cpu while they
> polled to see if the other procs were done.
> 
> We did this to help save power/energy, and because users had asked
> why the cpu utilization remained at 100% even though procs were
> waiting in finalize
> 
> HTH
> Ralph
> 
> On Aug 21, 2014, at 5:55 PM, Christopher Samuel
>  wrote:
> 
> > On 22/08/14 10:43, Ralph Castain wrote:
> > 
> >> From your earlier concerns, I would have expected only to find 32
> >> of them running. Was that not the case in this run?
> > 
> > As I understand it in his original email he mentioned that with
> > 1.6.5 all 48 processes were running at 100% CPU and was wondering
> > if the buggy BIOS that caused hwloc the issues he reported on the
> > hwloc-users list might be the cause for this regression in
> > performance.
> > 
> > All the best,
> > Chris
> > -- 
> > Christopher SamuelSenior Systems Administrator
> > VLSCI - Victorian Life Sciences Computation Initiative
> > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> > http://www.vlsci.org.au/  http://twitter.com/vlsci
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2014/08/15686.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15687.php

Re: [OMPI devel] 1.8.2rc5 released

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

Re: [OMPI devel] 1.8.2rc5 released

Re: [OMPI devel] 1.8.2rc5 released

[OMPI devel] 1.8.2rc5 released

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

Re: [OMPI devel] OMPI devel] MPI_Abort does not make mpirun return with the right exit code

Re: [OMPI devel] MPI_Abort does not make mpirun return with the right exit code

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

10 matches

Site Navigation

Mail list logo

Footer information