Re: [OMPI devel] InfiniBand BTL structure

2016-10-06 Thread Jeff Squyres (jsquyres)
The answer is: ...it's complicated.  :-)

There's unfortunately no real documentation available on the internal code 
structure.  IIRC (and it's been literally *years* since I've looked in the 
openib BTL) it uses both send-receive and RDMA, depending on the situation.

Your best bet would be to dig into the code a bit and then come back here and 
ask some specific questions.


> On Oct 5, 2016, at 9:10 AM, Gianmario Pozzi  wrote:
> 
> Hi everybody.
> 
> My friend Federico Reghenzani and I implemented a framework (MIG) to manage 
> MPI processes migration from one node to another, given that BTL only uses 
> TCP to communicate. The work was presented at Euro MPI in Edinburgh this 
> September.
> 
> Now, for my Master Thesis, I need to make MIG capable of working also with 
> InfiniBand-based BTL. 
> Is there anyone who could help me to better understand how it works? It looks 
> way more complex than TCP.
> Does it use only RDMA? If not, what else?
> 
> Any hint is appreciated.
> 
> Thank you guys, have a nice day.
> 
> -- 
> Gianmario Pozzi
> M.Sc. @ Politecnico di Milano
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Eric Chamberland

Hi Gilles,

just to mention that since the PR 2091 as been merged into 2.0.x, I 
haven't got any failure!


Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be a 
good one... So will there be a 2.0.2 release or will it go to 2.1.0 
directly?


Thanks,

Eric

On 16/09/16 10:01 AM, Gilles Gouaillardet wrote:

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would
not worry too much of that crash (to me, it is an undefined behavior anyway)

Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

Hi,

I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:

[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
hash 1366255883
[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[41545,0],0] from [[39075,0],0]
[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
stop comm


unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log




http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt



Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at

https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch





the bug is specific to singleton mode (e.g. ./a.out vs
mpirun -np 1
./a.out), so if applying a patch does not fit your test
workflow,

it might be easier for you to update it and mpirun -np 1
./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which
include
sprintf.
so yes, it is possible to crash an app by increasing
verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can
get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before
starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will
trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even
after I
relaunched the process..

Thanks for all the support!

Eric


Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the
*same* bug because
there has been a segfault:

stderr:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt





[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias
190552

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Gilles Gouaillardet

Eric,


2.0.2 is scheduled to happen.

2.1.0 will bring some new features whereas v2.0.2 is a bug fix release.

my guess is v2.0.2 will come first, but this is just a guess

(even if v2.1.0 comes first, v2.0.2 will be released anyway)


Cheers,


Gilles


On 10/7/2016 2:34 AM, Eric Chamberland wrote:

Hi Gilles,

just to mention that since the PR 2091 as been merged into 2.0.x, I 
haven't got any failure!


Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be 
a good one... So will there be a 2.0.2 release or will it go to 2.1.0 
directly?


Thanks,

Eric

On 16/09/16 10:01 AM, Gilles Gouaillardet wrote:

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would
not worry too much of that crash (to me, it is an undefined behavior 
anyway)


Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland
mailto:eric.chamberl...@giref.ulaval.ca>> wrote:

Hi,

I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:

[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
hash 1366255883
[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh 
path NULL

[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[41545,0],0] from 
[[39075,0],0]

[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
stop comm


unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt


Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch




the bug is specific to singleton mode (e.g. ./a.out vs
mpirun -np 1
./a.out), so if applying a patch does not fit your test
workflow,

it might be easier for you to update it and mpirun -np 1
./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which
include
sprintf.
so yes, it is possible to crash an app by increasing
verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can
get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the 
automatic

process started again... (which erase all directories before
starting) :/

I would like to put core files in a user specific directory, 
but it

seems it has to be a system-wide configuration... :/  I will
trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even
after I
relaunched the process..

Thanks for all the support!

Eric


Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the
*same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt