[OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Howard Pritchard
Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64

hfi1-*psm*-0.7-221.ch6.x86_64

hfi1-*psm*-devel-0.7-221.ch6.x86_64

infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard


Re: [OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Cabral, Matias A
Hi Howard,

Couple more questions to understand a little better the context:

-  What type of job running?

-  Is this also under srun?

For PSM2 you may find more details in the programmer’s guide:
http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf

To disable shared memory:
Section 2.7.1:
PSM2_DEVICES="self,fi"

Thanks,
_MAC

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 11:04 AM
To: Open MPI Developers List 
Subject: [OMPI devel] PSM2 Intel folks question

Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard



Re: [OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Cabral, Matias A
Errata:
PSM2_DEVICES="self,hfi"


_MAC

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Cabral, Matias A
Sent: Tuesday, April 19, 2016 11:25 AM
To: Open MPI Developers 
Subject: Re: [OMPI devel] PSM2 Intel folks question

Hi Howard,

Couple more questions to understand a little better the context:

-  What type of job running?

-  Is this also under srun?

For PSM2 you may find more details in the programmer’s guide:
http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf

To disable shared memory:
Section 2.7.1:
PSM2_DEVICES="self,fi"

Thanks,
_MAC

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 11:04 AM
To: Open MPI Developers List mailto:de...@open-mpi.org>>
Subject: [OMPI devel] PSM2 Intel folks question

Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard



Re: [OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Howard Pritchard
Hi Matias,

My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
If I disable the shared memory device using the PSM2_DEVICES option
it looks like psm2 is unhappy:


kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be
reached):

[kit001.localdomain:08222]  kit001

[kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):

[kit001.localdomain:08222]  kit001

 psm2_ep_connect returned 41

[kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):

[kit001.localdomain:08221]  kit001

[kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be
reached):

[kit001.localdomain:08221]  kit001

leaving ompi_mtl_psm2_add_procs nprocs 2


I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
and that works correctly on a single node.

I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
specific problem.

2016-04-19 12:25 GMT-06:00 Cabral, Matias A :

> Hi Howard,
>
>
>
> Couple more questions to understand a little better the context:
>
> -  What type of job running?
>
> -  Is this also under srun?
>
>
>
> For PSM2 you may find more details in the programmer’s guide:
>
>
> http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf
>
>
>
> To disable shared memory:
>
> Section 2.7.1:
>
> PSM2_DEVICES="self,fi"
>
>
>
> Thanks,
>
> _MAC
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Tuesday, April 19, 2016 11:04 AM
> *To:* Open MPI Developers List 
> *Subject:* [OMPI devel] PSM2 Intel folks question
>
>
>
> Hi Folks,
>
>
>
> I'm making progress with issue #1559 (patches on the mail list didn't
> help),
>
> and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
>
> noticing something more troublesome.
>
>
>
> If I run on just one node, and I use more than one process, process zero
>
> consistently hangs in psm2_ep_connect.
>
>
>
> I've tried using the psm2 code on github - at sha e951cf31, but I still see
>
> the same behavior.
>
>
>
> The PSM2 related rpms installed on our system are:
>
>
>
> infinipath-*psm*-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> hfi1-*psm*-0.7-221.ch6.x86_64
>
> hfi1-*psm*-devel-0.7-221.ch6.x86_64
>
> infinipath-*psm*-3.3-0.g6f42cdb1bb8.2.el7.x86_64
>
> should we get newer rpms installed?
>
>
>
> Is there a way to disable the AMSHM path?  I'm wondering if that
>
> would help since multi-node jobs seems to run fine.
>
>
>
> Thanks for any help,
>
>
>
> Howard
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/04/18783.php
>


[OMPI devel] seg fault when using yalla, XRC, and yalla

2016-04-19 Thread David Shrader

Hello,

I have been investigating using XRC on a cluster with a mellanox 
interconnect. I have found that in a certain situation I get a seg 
fault. I am using 1.10.2 compiled with gcc 5.3.0, and the simplest 
configure line that I have found that still results in the seg fault is 
as follows:


$> ./configure --with-hcoll --with-mxm --prefix=...

I do have mxm 3.4.3065 and hcoll 3.3.768 installed in to system space 
(/usr/lib64). If I use '--without-hcoll --without-mxm,' the seg fault 
does not happen.


The seg fault happens even when using examples/hello_c.c, so here is an 
example of the seg fault using it:


$> mpicc hello_c.c -o hello_c.x
$> mpirun -n 1 ./hello_c.x
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI 
dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: 
v1.10.1-145-g799148f, Jan 21, 2016, 135)
$> mpirun -n 1 -mca btl_openib_receive_queues 
X,4096,1024:X,12288,512:X,65536,512
Hello, world, I am 0 of 1, (Open MPI v1.10.2, package: Open MPI 
dshra...@mu-fey.lanl.gov Distribution, ident: 1.10.2, repo rev: 
v1.10.1-145-g799148f, Jan 21, 2016, 135)

--
mpirun noticed that process rank 0 with PID 22819 on node mu0001 exited 
on signal 11 (Segmentation fault).

--

The seg fault happens no matter the number of ranks. I have tried the 
above command with '-mca pml_base_verbose,' and it shows that I am using 
the yalla pml:


$> mpirun -n 1 -mca btl_openib_receive_queues 
X,4096,1024:X,12288,512:X,65536,512 -mca pml_base_verbose 100 ./hello_c.x

...output snipped...
[mu0001.localdomain:22825] select: component cm not selected / finalized
[mu0001.localdomain:22825] select: component ob1 not selected / finalized
[mu0001.localdomain:22825] select: component yalla selected
...output snipped...
--
mpirun noticed that process rank 0 with PID 22825 on node mu0001 exited 
on signal 11 (Segmentation fault).

--

Interestingly enough, if I tell mpirun what pml to use, the seg fault 
goes away. The following command does not get the seg fault:


$> mpirun -n 1 -mca btl_openib_receive_queues 
X,4096,1024:X,12288,512:X,65536,512 -mca pml yalla ./hello_c.x


Passing either ob1 or cm to '-mca pml' also works. So it seems that the 
seg fault comes about when the yalla pml is chosen by default, when 
mxm/hcoll is involved, and using XRC. I'm not sure if mxm is to blame, 
however, as using '-mca pml cm -mca mtl mxm' with the XRC parameters 
doesn't throw the seg fault.


Other information...
OS: RHEL 6.7-based (TOSS)
OpenFabrics: RedHat provided
Kernel: 2.6.32-573.8.1.2chaos.ch5.4.x86_64
Config.log and 'ompi_info --all' are in the tarball ompi.tar.bz2 which 
is attached.


Is there something else I should be doing with the yalla pml when using 
XRC? Regardless, I hope reporting the seg fault is useful.


Thanks,
David

--
David Shrader
HPC-ENV High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov



ompi.tar.bz2
Description: application/bzip


Re: [OMPI devel] PSM2 Intel folks question

2016-04-19 Thread Cabral, Matias A
Howard,

PSM2_DEVICES, I went back to the roots and found that shm is the only device 
supporting communication between ranks in the same node. Therefore, the below 
error “Endpoint could not be reached” would be expected.

Back to the psm2_ep_connect() hanging, I cloned the same psm2 as you have from 
github and have hello_c and ring_c running with 80 ranks on a local node using 
PSM2 mtl. I do not have any SLURM setup on my system.  I will proceed to setup 
SLURM to see if I can reproduce the issue with it. In the meantime please share 
any extra detail you find relevant.

Thanks,

_MAC

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 12:21 PM
To: Open MPI Developers 
Subject: Re: [OMPI devel] PSM2 Intel folks question

Hi Matias,

My usual favorites in ompi/examples/hello_c.c and ompi/examples/ring_c.c.
If I disable the shared memory device using the PSM2_DEVICES option
it looks like psm2 is unhappy:


kit001.localdomain:08222] PSM2 EP connect error (Endpoint could not be reached):
[kit001.localdomain:08222]  kit001
[kit001.localdomain:08222] PSM2 EP connect error (unknown connect error):
[kit001.localdomain:08222]  kit001
 psm2_ep_connect returned 41
[kit001.localdomain:08221] PSM2 EP connect error (unknown connect error):
[kit001.localdomain:08221]  kit001
[kit001.localdomain:08221] PSM2 EP connect error (Endpoint could not be 
reached):
[kit001.localdomain:08221]  kit001
leaving ompi_mtl_psm2_add_procs nprocs 2

I went back and tried again with the OFI MTL (without the PSM2_DEVICES set)
and that works correctly on a single node.
I get this same psm2_ep_connect timeout using mpirun, so its not a SLURM
specific problem.

2016-04-19 12:25 GMT-06:00 Cabral, Matias A 
mailto:matias.a.cab...@intel.com>>:
Hi Howard,

Couple more questions to understand a little better the context:

-  What type of job running?

-  Is this also under srun?

For PSM2 you may find more details in the programmer’s guide:
http://www.intel.com/content/dam/support/us/en/documents/network/omni-adptr/sb/Intel_PSM2_PG_H76473_v1_0.pdf

To disable shared memory:
Section 2.7.1:
PSM2_DEVICES="self,fi"

Thanks,
_MAC

From: devel 
[mailto:devel-boun...@open-mpi.org] On 
Behalf Of Howard Pritchard
Sent: Tuesday, April 19, 2016 11:04 AM
To: Open MPI Developers List mailto:de...@open-mpi.org>>
Subject: [OMPI devel] PSM2 Intel folks question

Hi Folks,

I'm making progress with issue #1559 (patches on the mail list didn't help),
and I'll open a PR to help the PSM2 MTL work on a single node, but I'm
noticing something more troublesome.

If I run on just one node, and I use more than one process, process zero
consistently hangs in psm2_ep_connect.

I've tried using the psm2 code on github - at sha e951cf31, but I still see
the same behavior.

The PSM2 related rpms installed on our system are:

infinipath-psm-devel-3.3-0.g6f42cdb1bb8.2.el7.x86_64
hfi1-psm-0.7-221.ch6.x86_64
hfi1-psm-devel-0.7-221.ch6.x86_64
infinipath-psm-3.3-0.g6f42cdb1bb8.2.el7.x86_64
should we get newer rpms installed?

Is there a way to disable the AMSHM path?  I'm wondering if that
would help since multi-node jobs seems to run fine.

Thanks for any help,

Howard


___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18783.php