Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Howard Pritchard
Hi Folks,

I remember in the psm provider for libfabric, that there is a check in the
av_insert method for endpoints
that had previously been inserted into the av.  In the libfabric psm
provider, a mask array is created and fed
in to the psm_ep_connect call to handle ep's that were already
"connected".  I notice for the psm mtl
in ompi, a mask array is not provided, just a NULL.

Howard



2014-11-11 16:00 GMT-07:00 George Bosilca :

>
> > On Nov 11, 2014, at 17:13 , Jeff Squyres (jsquyres) 
> wrote:
> >
> >> More particularly, it looks like add_procs is being called a second
> time during MPI_Intercomm_create and being passed a process that is already
> connected (passed into the first add_procs call).  Is that right?  Should
> the MTL handle multiple add_procs calls with the same proc provided?
> >
> > I'm afraid I don't know much about the MTL interface.
> >
> > George / Nathan?
>
> The Intercom_create is a funny function, as it can join together two
> groups of processes that didn’t knew each other before. Thus, we have to be
> conservative in the upper level of the function and provide the entire list
> of [potentially] new processes to the PML/MTL to add to their known
> processes. In the case of the PML, this list is then forwarded down to the
> BTL, where only the new processes are added. Thus, the BTLs support adding
> multiple time the same process.
>
> I think a similar mechanism should be added to the MTL. If the process is
> already known, just mark it as reachable and be done.
>
>   George.
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16297.php


Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread George Bosilca

> On Nov 11, 2014, at 17:13 , Jeff Squyres (jsquyres)  
> wrote:
> 
>> More particularly, it looks like add_procs is being called a second time 
>> during MPI_Intercomm_create and being passed a process that is already 
>> connected (passed into the first add_procs call).  Is that right?  Should 
>> the MTL handle multiple add_procs calls with the same proc provided?
> 
> I'm afraid I don't know much about the MTL interface.
> 
> George / Nathan?

The Intercom_create is a funny function, as it can join together two groups of 
processes that didn’t knew each other before. Thus, we have to be conservative 
in the upper level of the function and provide the entire list of [potentially] 
new processes to the PML/MTL to add to their known processes. In the case of 
the PML, this list is then forwarded down to the BTL, where only the new 
processes are added. Thus, the BTLs support adding multiple time the same 
process.

I think a similar mechanism should be added to the MTL. If the process is 
already known, just mark it as reachable and be done.

  George.



Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Friedley, Andrew
Ralph,

You're right that PSM wouldn't support dynamically connecting jobs.  I don't 
think intercomm_create implies that though.  For example you could split 
COMM_WORLD's group into two groups, then create an intercommunicator across 
those two groups.  I'm guessing that's what this test is doing, I'd have to go 
read the code to be sure though.

I verified this tests works over PSM and OMPI 1.6.5; it fails on 1.8.1 and 
1.8.3.

Andrew

> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph
> Castain
> Sent: Tuesday, November 11, 2014 2:23 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> 
> I thought PSM didn’t support dynamic operations such as Intercomm_create
> - yes? The PSM security key wouldn’t match between the two jobs, and so
> there is no way for them to communicate.
> 
> Which is why I thought PSM can’t be used for dynamic operations at all,
> including comm_spawn and connect/accept
> 
> 
> > On Nov 11, 2014, at 2:13 PM, Jeff Squyres (jsquyres) 
> wrote:
> >
> > On Nov 11, 2014, at 4:56 PM, Friedley, Andrew
>  wrote:
> >
> >> OK, I'm able to reproduce this now, not sure why I couldn't before.  I took
> a look at the diff of the PSM MTL from 1.6.5 to 1.8.1, and nothing is standing
> out to me.
> >>
> >> Question more for the general group:  Did anything related to the
> behavior/usage of MTL add_procs() change in this time window?
> >
> > The time between the 1.6.x series and the 1.8.x series is measure in terms
> of a year or two, so, ya, something might have changed...
> >
> >> More particularly, it looks like add_procs is being called a second time
> during MPI_Intercomm_create and being passed a process that is already
> connected (passed into the first add_procs call).  Is that right?  Should the
> MTL handle multiple add_procs calls with the same proc provided?
> >
> > I'm afraid I don't know much about the MTL interface.
> >
> > George / Nathan?
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/11/16294.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-
> mpi.org/community/lists/devel/2014/11/16295.php


Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Ralph Castain
I thought PSM didn’t support dynamic operations such as Intercomm_create - yes? 
The PSM security key wouldn’t match between the two jobs, and so there is no 
way for them to communicate.

Which is why I thought PSM can’t be used for dynamic operations at all, 
including comm_spawn and connect/accept


> On Nov 11, 2014, at 2:13 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> On Nov 11, 2014, at 4:56 PM, Friedley, Andrew  
> wrote:
> 
>> OK, I'm able to reproduce this now, not sure why I couldn't before.  I took 
>> a look at the diff of the PSM MTL from 1.6.5 to 1.8.1, and nothing is 
>> standing out to me.
>> 
>> Question more for the general group:  Did anything related to the 
>> behavior/usage of MTL add_procs() change in this time window?
> 
> The time between the 1.6.x series and the 1.8.x series is measure in terms of 
> a year or two, so, ya, something might have changed...
> 
>> More particularly, it looks like add_procs is being called a second time 
>> during MPI_Intercomm_create and being passed a process that is already 
>> connected (passed into the first add_procs call).  Is that right?  Should 
>> the MTL handle multiple add_procs calls with the same proc provided?
> 
> I'm afraid I don't know much about the MTL interface.
> 
> George / Nathan?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16294.php



Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Jeff Squyres (jsquyres)
On Nov 11, 2014, at 4:56 PM, Friedley, Andrew  wrote:

> OK, I'm able to reproduce this now, not sure why I couldn't before.  I took a 
> look at the diff of the PSM MTL from 1.6.5 to 1.8.1, and nothing is standing 
> out to me.
> 
> Question more for the general group:  Did anything related to the 
> behavior/usage of MTL add_procs() change in this time window?

The time between the 1.6.x series and the 1.8.x series is measure in terms of a 
year or two, so, ya, something might have changed...

> More particularly, it looks like add_procs is being called a second time 
> during MPI_Intercomm_create and being passed a process that is already 
> connected (passed into the first add_procs call).  Is that right?  Should the 
> MTL handle multiple add_procs calls with the same proc provided?

I'm afraid I don't know much about the MTL interface.

George / Nathan?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Friedley, Andrew
OK, I'm able to reproduce this now, not sure why I couldn't before.  I took a 
look at the diff of the PSM MTL from 1.6.5 to 1.8.1, and nothing is standing 
out to me.

Question more for the general group:  Did anything related to the 
behavior/usage of MTL add_procs() change in this time window?

More particularly, it looks like add_procs is being called a second time during 
MPI_Intercomm_create and being passed a process that is already connected 
(passed into the first add_procs call).  Is that right?  Should the MTL handle 
multiple add_procs calls with the same proc provided?

Andrew

> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> Reber
> Sent: Monday, October 27, 2014 1:41 AM
> To: de...@open-mpi.org
> Subject: [OMPI devel] 1.8.3 and PSM errors
> 
> Running Open MPI 1.8.3 with PSM does not seem to work right now at all.
> I am getting the same errors also on trunk from my newly set up MTT.
> Before trying to debug this I just wanted to make sure this is not a
> configuration error. I have following PSM packages installed:
> 
> infinipath-devel-3.1.1-363.1140_rhel6_qlc.noarch
> infinipath-libs-3.1.1-363.1140_rhel6_qlc.x86_64
> infinipath-3.1.1-363.1140_rhel6_qlc.x86_64
> 
> with 1.6.5 I do not see PSM errors and the test suite fails much later:
> 
> P2P tests Many-to-one with MPI_Iprobe (MPI_ANY_SOURCE) (21/48), comm
> Intracomm merged of the Halved Intercomm (13/13), type
> MPI_TYPE_MIX_ARRAY (28/29) P2P tests Many-to-one with MPI_Iprobe
> (MPI_ANY_SOURCE) (21/48), comm Intracomm merged of the Halved
> Intercomm (13/13), type MPI_TYPE_MIX_LB_UB (29/29) n050304:5.0.Cannot
> cancel send requests (req=0x2ad8ba881f80) P2P tests Many-to-one with
> Isend and Cancellation (22/48), comm MPI_COMM_WORLD (1/13), type
> MPI_CHAR (1/29) n050304:2.0.Cannot cancel send requests
> (req=0x2b25143fbd88) n050302:7.0.Cannot cancel send requests
> (req=0x2b4d95eb0f80) n050301:4.0.Cannot cancel send requests
> (req=0x2adf03e14f80) n050304:4.0.Cannot cancel send requests
> (req=0x2ad877257ed8) n050301:6.0.Cannot cancel send requests
> (req=0x2ba47634af80) n050304:8.0.Cannot cancel send requests
> (req=0x2ae8ac16cf80) n050302:3.0.Cannot cancel send requests
> (req=0x2ab81dcb4d88) n050303:4.0.Cannot cancel send requests
> (req=0x2b9ef4ef8f80) n050303:2.0.Cannot cancel send requests
> (req=0x2ab0f03f9f80) n050302:9.0.Cannot cancel send requests
> (req=0x2b214f9ebed8) n050301:2.0.Cannot cancel send requests
> (req=0x2b31302d4f80) n050302:4.0.Cannot cancel send requests
> (req=0x2b0581bd3f80) n050301:8.0.Cannot cancel send requests
> (req=0x2ae53776bf80) n050303:6.0.Cannot cancel send requests
> (req=0x2b13eeb78f80) n050304:7.0.Cannot cancel send requests
> (req=0x2b4e99715f80) n050304:9.0.Cannot cancel send requests
> (req=0x2b10429c2f80) n050304:3.0.Cannot cancel send requests
> (req=0x2b9196f5fe30) n050304:6.0.Cannot cancel send requests
> (req=0x2b30d6c69ed8) n050301:9.0.Cannot cancel send requests
> (req=0x2b93c9e04f80) n050303:9.0.Cannot cancel send requests
> (req=0x2ab4d6ce0f80) n050301:5.0.Cannot cancel send requests
> (req=0x2b6ad851ef80) n050303:3.0.Cannot cancel send requests
> (req=0x2b8ef52a0f80) n050301:3.0.Cannot cancel send requests
> (req=0x2b277a4aff80) n050303:7.0.Cannot cancel send requests
> (req=0x2ba570fa9f80) n050301:7.0.Cannot cancel send requests
> (req=0x2ba707dfbf80) n050302:2.0.Cannot cancel send requests
> (req=0x2b90f2e51e30) n050303:5.0.Cannot cancel send requests
> (req=0x2b1250ba8f80) n050302:8.0.Cannot cancel send requests
> (req=0x2b22e0129ed8) n050303:8.0.Cannot cancel send requests
> (req=0x2b6609792f80) n050302:6.0.Cannot cancel send requests
> (req=0x2b2b6081af80) n050302:5.0.Cannot cancel send requests
> (req=0x2ab24f6f1f80)
> --
> mpirun has exited due to process rank 14 with PID 4496 on node n050303
> exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in the job 
> did. This
> can cause a job to hang indefinitely while it waits for all processes to call
> "init". By rule, if one process calls "init", then ALL processes must call 
> "init"
> prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to exiting 
> or it
> will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be terminated by
> signals sent by mpirun (as reported here).
> --
> [adrian@n050304 mpi_test_suite]$
> 
> and this are my PSM errors with 1.8.3:
> 
> [adrian@n050304 mpi_test_suite]$ mpirun  -np 32  mpi_test_suite -t
> "All,^io,^one-sided"
> 
> mpi_test_suite:8904 terminated with signal 11 at PC=2b08466239a4
> SP=703c6e30.  Backtrace:
> 
> mpi_test_suite:16905 terminated with signa

[OMPI devel] Open MPI face-to-face devel meeting

2014-11-11 Thread Jeff Squyres (jsquyres)
On the call today, we decided the final dates/location of the OMPI face-to-face 
developers meeting:

- Dates: Jan 27-29, 2015 (Tue-Wed)
- Location: Cisco Richardson, Texas facility (outside Dallas)

See the wiki page for more details:

1. Put your name on the wiki if you plan to attend
2. Start putting topics on there to discuss

https://github.com/open-mpi/ompi/wiki/Meeting-2015-01

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Adrian Reber
Using the intel test suite I can reproduce it for example with:

$ mpirun --np 2 --map-by ppr:1:node   `pwd`/src/MPI_Allgatherv_c
MPITEST info  (0): Starting MPI_Allgatherv() test
MPITEST info  (0): Node spec MPITEST_comm_sizes[6]=2 too large, using 1
MPITEST info  (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1
MPITEST info  (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1

MPI_Allgatherv_c:9230 terminated with signal 11 at PC=7fc4ced4b150 
SP=7fff45aa2fb0.  Backtrace:
/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7fc4ced4b150]
/lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7fc4ced4219a]
/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7fc4ced3a727]
/opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7fc4cf902303]
/opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7fc4cf7cbc2a]
/opt/bwhpc/common/mpi/openmpi/1.8.4-gnu-4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7fc4cf7fb602]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x40f5bf]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x40edf4]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x401c80]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc4cf1a8af5]
/lustre/lxfs/work/ws/es_test01-open_mpi-0/ompi-tests/intel_tests/src/MPI_Allgatherv_c[0x401a89]
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---


On Tue, Nov 11, 2014 at 10:26:52AM -0800, Ralph Castain wrote:
> I think it would help understand this if you isolated it down to a single 
> test that is failing, rather than just citing an entire test suite. For 
> example, we know that the many-to-one test is never going to pass, regardless 
> of transport. We also know that the dynamic tests will fail with PSM as they 
> are not supported by that transport.
> 
> So could you find one test that doesn’t pass, and give us some info on that 
> one?
> 
> 
> > On Nov 11, 2014, at 10:04 AM, Adrian Reber  wrote:
> > 
> > Some more information about our PSM troubles.
> > 
> > Using 1.6.5 the test suite still works. It fails with 1.8.3 and
> > 1.8.4rc1. As long as all processes are running on one node it also
> > works. As soon as one process is running on a second node it fails with
> > the previously described errors. I also tried the 1.8 release and it has
> > the same error. Another way to trigger it with only two processes is:
> > 
> > mpirun --np 2 --map-by ppr:1:node   mpi_test_suite -t "environment"
> > 
> > Some change introduced between 1.6.5 and 1.8 broke this test case with
> > PSM. I have not yet been able to upgrade PSM to 3.3 but it seems more
> > Open MPI related than PSM.
> > 
> > Intel MPI (4.1.1) has also no troubles running the test cases.
> > 
> > Adrian
> > 
> > On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> >> Hi Adrian,
> >> 
> >> Yes, I suggest trying either RH support or Intel's support at  
> >> ibsupp...@intel.com.  They might have seen this problem before.  Since 
> >> you're running the RHEL versions of PSM and related software, one thing 
> >> you could try is IFS.  I think I was running IFS 7.3.0, so that's a 
> >> difference between your setup and mine.  At the least, it may help support 
> >> nail down the issue.
> >> 
> >> Andrew
> >> 
> >>> -Original Message-
> >>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> >>> Reber
> >>> Sent: Monday, November 10, 2014 12:39 PM
> >>> To: Open MPI Developers
> >>> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> >>> 
> >>> Andrew,
> >>> 
> >>> thanks for looking into this. I was able to reproduce this error on RHEL 
> >>> 7 with
> >>> PSM provided by RHEL:
> >>> 
> >>> infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
> >>> infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
> >>> 
> >>> $ mpirun -np 32 mpi_test_suite -t "environment"
> >>> 
> >>> mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
> >>> SP=7fff9e0ce770.  Backtrace:
> >>> /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
> >>> /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
> >>> /lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7f5a2f491727]
> >>> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> >>> 4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7f5a30054cf3]
> >>> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> >>> 4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7f5a2ff221da]
> >>> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> >>> 4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7f5a2ff51832]
> >>> mpi_test_suite[0x469420]
> >>> mpi_test_suite[0x441d8e]
> >>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5a2f8ffaf5]
> >>> mpi_test_suite

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Ralph Castain
I think it would help understand this if you isolated it down to a single test 
that is failing, rather than just citing an entire test suite. For example, we 
know that the many-to-one test is never going to pass, regardless of transport. 
We also know that the dynamic tests will fail with PSM as they are not 
supported by that transport.

So could you find one test that doesn’t pass, and give us some info on that one?


> On Nov 11, 2014, at 10:04 AM, Adrian Reber  wrote:
> 
> Some more information about our PSM troubles.
> 
> Using 1.6.5 the test suite still works. It fails with 1.8.3 and
> 1.8.4rc1. As long as all processes are running on one node it also
> works. As soon as one process is running on a second node it fails with
> the previously described errors. I also tried the 1.8 release and it has
> the same error. Another way to trigger it with only two processes is:
> 
> mpirun --np 2 --map-by ppr:1:node   mpi_test_suite -t "environment"
> 
> Some change introduced between 1.6.5 and 1.8 broke this test case with
> PSM. I have not yet been able to upgrade PSM to 3.3 but it seems more
> Open MPI related than PSM.
> 
> Intel MPI (4.1.1) has also no troubles running the test cases.
> 
>   Adrian
> 
> On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
>> Hi Adrian,
>> 
>> Yes, I suggest trying either RH support or Intel's support at  
>> ibsupp...@intel.com.  They might have seen this problem before.  Since 
>> you're running the RHEL versions of PSM and related software, one thing you 
>> could try is IFS.  I think I was running IFS 7.3.0, so that's a difference 
>> between your setup and mine.  At the least, it may help support nail down 
>> the issue.
>> 
>> Andrew
>> 
>>> -Original Message-
>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
>>> Reber
>>> Sent: Monday, November 10, 2014 12:39 PM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
>>> 
>>> Andrew,
>>> 
>>> thanks for looking into this. I was able to reproduce this error on RHEL 7 
>>> with
>>> PSM provided by RHEL:
>>> 
>>> infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
>>> infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
>>> 
>>> $ mpirun -np 32 mpi_test_suite -t "environment"
>>> 
>>> mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
>>> SP=7fff9e0ce770.  Backtrace:
>>> /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
>>> /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
>>> /lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7f5a2f491727]
>>> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
>>> 4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7f5a30054cf3]
>>> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
>>> 4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7f5a2ff221da]
>>> /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
>>> 4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7f5a2ff51832]
>>> mpi_test_suite[0x469420]
>>> mpi_test_suite[0x441d8e]
>>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5a2f8ffaf5]
>>> mpi_test_suite[0x405349]
>>> 
>>> Source RPM  : infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.src.rpm
>>> Build Date  : Tue 04 Mar 2014 02:45:41 AM CET Build Host  : x86-
>>> 025.build.eng.bos.redhat.com Relocations : /usr
>>> Packager: Red Hat, Inc. 
>>> Vendor  : Red Hat, Inc.
>>> URL : 
>>> http://www.openfabrics.org/downloads/infinipath-psm/infinipath-
>>> psm-3.2-2_ga8c3e3e_open.tar.gz
>>> Summary : QLogic PSM Libraries
>>> 
>>> Is this supposed to work? Or is this something Red Hat has to fix?
>>> 
>>> Adrian
>>> 
>>> On Mon, Oct 27, 2014 at 10:22:08PM +, Friedley, Andrew wrote:
 Hi Adrian,
 
 I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this
>>> with one 8-core node):
 
 $ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t
>>> "environment"
 (Rank:0) tst_test_array[0]:Status
 (Rank:0) tst_test_array[1]:Request_Null
 (Rank:0) tst_test_array[2]:Type_dup
 (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0
 
 Works with various np from 8 to 32.  Your original case:
 
 $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
 
 Runs for a while and eventually hits send cancellation errors.
 
 Any chance you could try updating your infinipath libraries?
 
 Andrew
 
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> Reber
> Sent: Monday, October 27, 2014 9:11 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> 
> This is a simpler test setup:
> 
> On 8 core machines this works:
> 
> $ mpirun  -np 8  mpi_test_suite -t "environment"
> [...]
> Number of failed tests:0
> 
> Using 9 or more cores it fails:
> 
> $ mpirun  -np 9  mpi_test_suite -t "environment"
>>>

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-11 Thread Adrian Reber
Some more information about our PSM troubles.

Using 1.6.5 the test suite still works. It fails with 1.8.3 and
1.8.4rc1. As long as all processes are running on one node it also
works. As soon as one process is running on a second node it fails with
the previously described errors. I also tried the 1.8 release and it has
the same error. Another way to trigger it with only two processes is:

mpirun --np 2 --map-by ppr:1:node   mpi_test_suite -t "environment"

Some change introduced between 1.6.5 and 1.8 broke this test case with
PSM. I have not yet been able to upgrade PSM to 3.3 but it seems more
Open MPI related than PSM.

Intel MPI (4.1.1) has also no troubles running the test cases.

Adrian

On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> Hi Adrian,
> 
> Yes, I suggest trying either RH support or Intel's support at  
> ibsupp...@intel.com.  They might have seen this problem before.  Since you're 
> running the RHEL versions of PSM and related software, one thing you could 
> try is IFS.  I think I was running IFS 7.3.0, so that's a difference between 
> your setup and mine.  At the least, it may help support nail down the issue.
> 
> Andrew
> 
> > -Original Message-
> > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > Reber
> > Sent: Monday, November 10, 2014 12:39 PM
> > To: Open MPI Developers
> > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > 
> > Andrew,
> > 
> > thanks for looking into this. I was able to reproduce this error on RHEL 7 
> > with
> > PSM provided by RHEL:
> > 
> > infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
> > infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
> > 
> > $ mpirun -np 32 mpi_test_suite -t "environment"
> > 
> > mpi_test_suite:4877 terminated with signal 11 at PC=7f5a2f4a2150
> > SP=7fff9e0ce770.  Backtrace:
> > /lib64/libpsm_infinipath.so.1(ips_proto_connect+0x630)[0x7f5a2f4a2150]
> > /lib64/libpsm_infinipath.so.1(ips_ptl_connect+0x3a)[0x7f5a2f49919a]
> > /lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x3e7)[0x7f5a2f491727]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1f3)[0x7f5a30054cf3]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(ompi_comm_get_rprocs+0x49a)[0x7f5a2ff221da]
> > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > 4.8/lib/libmpi.so.1(PMPI_Intercomm_create+0x2f2)[0x7f5a2ff51832]
> > mpi_test_suite[0x469420]
> > mpi_test_suite[0x441d8e]
> > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5a2f8ffaf5]
> > mpi_test_suite[0x405349]
> > 
> > Source RPM  : infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.src.rpm
> > Build Date  : Tue 04 Mar 2014 02:45:41 AM CET Build Host  : x86-
> > 025.build.eng.bos.redhat.com Relocations : /usr
> > Packager: Red Hat, Inc. 
> > Vendor  : Red Hat, Inc.
> > URL : 
> > http://www.openfabrics.org/downloads/infinipath-psm/infinipath-
> > psm-3.2-2_ga8c3e3e_open.tar.gz
> > Summary : QLogic PSM Libraries
> > 
> > Is this supposed to work? Or is this something Red Hat has to fix?
> > 
> > Adrian
> > 
> > On Mon, Oct 27, 2014 at 10:22:08PM +, Friedley, Andrew wrote:
> > > Hi Adrian,
> > >
> > > I'm unable to reproduce here with OMPI v1.8.3 (I assume you're doing this
> > with one 8-core node):
> > >
> > > $ mpirun -np 32 -mca pml cm -mca mtl psm ./mpi_test_suite -t
> > "environment"
> > > (Rank:0) tst_test_array[0]:Status
> > > (Rank:0) tst_test_array[1]:Request_Null
> > > (Rank:0) tst_test_array[2]:Type_dup
> > > (Rank:0) tst_test_array[3]:Get_version Number of failed tests:0
> > >
> > > Works with various np from 8 to 32.  Your original case:
> > >
> > > $ mpirun -np 32 ./mpi_test_suite -t "All,^io,^one-sided"
> > >
> > > Runs for a while and eventually hits send cancellation errors.
> > >
> > > Any chance you could try updating your infinipath libraries?
> > >
> > > Andrew
> > >
> > > > -Original Message-
> > > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adrian
> > > > Reber
> > > > Sent: Monday, October 27, 2014 9:11 AM
> > > > To: Open MPI Developers
> > > > Subject: Re: [OMPI devel] 1.8.3 and PSM errors
> > > >
> > > > This is a simpler test setup:
> > > >
> > > > On 8 core machines this works:
> > > >
> > > > $ mpirun  -np 8  mpi_test_suite -t "environment"
> > > > [...]
> > > > Number of failed tests:0
> > > >
> > > > Using 9 or more cores it fails:
> > > >
> > > > $ mpirun  -np 9  mpi_test_suite -t "environment"
> > > >
> > > > mpi_test_suite:20293 terminated with signal 11 at PC=2b6d107fa9a4
> > > > SP=7fff06431a70.  Backtrace:
> > > > /usr/lib64/libpsm_infinipath.so.1(ips_proto_connect+0x334)[0x2b6d107
> > > > fa9a
> > > > 4]
> > > >
> > /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x692)[0x2b6d107e
> > > > b1
> > > > 72]
> > > > /opt/bwhpc/common/mpi/openmpi/1.8.3-gnu-
> > > > 4.9/lib/libmpi.so.1(ompi_mtl_psm_add_procs+0x1a4)[0x2b6d0fa6e384]
> > > > /opt/bwhpc/common/mpi/openmpi/1.8.3-g

Re: [OMPI devel] OMPI devel] Jenkins vs master (and v1.8)

2014-11-11 Thread Mike Dubman
rhel6.4
we can provide ssh access to interested parties.

On Tue, Nov 11, 2014 at 2:01 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Thanks Mike,
>
> BTW what is the distro running on your test cluster ?
>
> Mike Dubman  wrote:
> ok, I disabled vader tests in SHMEM and it passes.
> it can be requested from jenkins by specifying "vader" in PR comment line.
>
> On Tue, Nov 11, 2014 at 11:04 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>>  Mike,
>>
>> that will remove the false positive, but also remove an important piece
>> of information :
>> there is something wrong with the master.
>>
>> would you mind discussion this on the weekly call ?
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2014/11/11 17:38, Mike Dubman wrote:
>>
>> how about if I will disable the failing test(s) and make jenkins to pass?
>> It will help us to make sure we don`t break something that did work before?
>>
>> On Tue, Nov 11, 2014 at 7:02 AM, Gilles Gouaillardet 
>>  wrote:
>>
>>
>>  Mike,
>>
>> Jenkins runs automated tests on each pull request, and i think this is a
>> good thing.
>>
>> recently, it reported a bunch of failure but i could not find anything
>> to blame in the PR itself.
>>
>> so i created a dummy PR https://github.com/open-mpi/ompi/pull/264 with
>> git commit --allow-empty
>> and waited for Jenkins to do its job.
>>
>> the test failed, which means there is an issue in the master.
>> from the master point of view, it is good to know there is an issue.
>> from the PR point of view, this is a false positive since the PR does
>> nothing wrong.
>>
>> i was unable to find anything on github that indicates the master does
>> not pass the automated tests.
>> is such automated test running vs the master ? if yes, where can we find
>> the results ?
>> in order to avoid dealing with false positive, is there any possibility
>> to disable automated tests on the PR
>> if the master does not pass the tests ?
>>
>> Cheers,
>>
>> Gilles
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this 
>> post:http://www.open-mpi.org/community/lists/devel/2014/11/16283.php
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16284.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16285.php
>>
>
>
>
> --
>
> Kind Regards,
>
> M.
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16287.php
>



-- 

Kind Regards,

M.


Re: [OMPI devel] OMPI devel] Jenkins vs master (and v1.8)

2014-11-11 Thread Gilles Gouaillardet
Thanks Mike,

BTW what is the distro running on your test cluster ?

Mike Dubman  wrote:
>ok, I disabled vader tests in SHMEM and it passes.
>
>it can be requested from jenkins by specifying "vader" in PR comment line.
>
>
>On Tue, Nov 11, 2014 at 11:04 AM, Gilles Gouaillardet 
> wrote:
>
>Mike,
>
>that will remove the false positive, but also remove an important piece of 
>information :
>there is something wrong with the master.
>
>would you mind discussion this on the weekly call ?
>
>Cheers,
>
>Gilles
>
>
>
>On 2014/11/11 17:38, Mike Dubman wrote:
>
>how about if I will disable the failing test(s) and make jenkins to pass? It 
>will help us to make sure we don`t break something that did work before? On 
>Tue, Nov 11, 2014 at 7:02 AM, Gilles Gouaillardet < 
>gilles.gouaillar...@iferc.org> wrote: 
>
>Mike, Jenkins runs automated tests on each pull request, and i think this is a 
>good thing. recently, it reported a bunch of failure but i could not find 
>anything to blame in the PR itself. so i created a dummy PR 
>https://github.com/open-mpi/ompi/pull/264 with git commit --allow-empty and 
>waited for Jenkins to do its job. the test failed, which means there is an 
>issue in the master. from the master point of view, it is good to know there 
>is an issue. from the PR point of view, this is a false positive since the PR 
>does nothing wrong. i was unable to find anything on github that indicates the 
>master does not pass the automated tests. is such automated test running vs 
>the master ? if yes, where can we find the results ? in order to avoid dealing 
>with false positive, is there any possibility to disable automated tests on 
>the PR if the master does not pass the tests ? Cheers, Gilles 
>___ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16283.php 
>
>
>
>___ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16284.php 
>
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16285.php
>
>
>
>
>-- 
>
>
>Kind Regards,
>
>
>M.
>


Re: [OMPI devel] Jenkins vs master (and v1.8)

2014-11-11 Thread Mike Dubman
ok, I disabled vader tests in SHMEM and it passes.
it can be requested from jenkins by specifying "vader" in PR comment line.

On Tue, Nov 11, 2014 at 11:04 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Mike,
>
> that will remove the false positive, but also remove an important piece of
> information :
> there is something wrong with the master.
>
> would you mind discussion this on the weekly call ?
>
> Cheers,
>
> Gilles
>
>
> On 2014/11/11 17:38, Mike Dubman wrote:
>
> how about if I will disable the failing test(s) and make jenkins to pass?
> It will help us to make sure we don`t break something that did work before?
>
> On Tue, Nov 11, 2014 at 7:02 AM, Gilles Gouaillardet 
>  wrote:
>
>
>  Mike,
>
> Jenkins runs automated tests on each pull request, and i think this is a
> good thing.
>
> recently, it reported a bunch of failure but i could not find anything
> to blame in the PR itself.
>
> so i created a dummy PR https://github.com/open-mpi/ompi/pull/264 with
> git commit --allow-empty
> and waited for Jenkins to do its job.
>
> the test failed, which means there is an issue in the master.
> from the master point of view, it is good to know there is an issue.
> from the PR point of view, this is a false positive since the PR does
> nothing wrong.
>
> i was unable to find anything on github that indicates the master does
> not pass the automated tests.
> is such automated test running vs the master ? if yes, where can we find
> the results ?
> in order to avoid dealing with false positive, is there any possibility
> to disable automated tests on the PR
> if the master does not pass the tests ?
>
> Cheers,
>
> Gilles
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2014/11/16283.php
>
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16284.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16285.php
>



-- 

Kind Regards,

M.


Re: [OMPI devel] Jenkins vs master (and v1.8)

2014-11-11 Thread Gilles Gouaillardet
Mike,

that will remove the false positive, but also remove an important piece
of information :
there is something wrong with the master.

would you mind discussion this on the weekly call ?

Cheers,

Gilles

On 2014/11/11 17:38, Mike Dubman wrote:
> how about if I will disable the failing test(s) and make jenkins to pass?
> It will help us to make sure we don`t break something that did work before?
>
> On Tue, Nov 11, 2014 at 7:02 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Mike,
>>
>> Jenkins runs automated tests on each pull request, and i think this is a
>> good thing.
>>
>> recently, it reported a bunch of failure but i could not find anything
>> to blame in the PR itself.
>>
>> so i created a dummy PR https://github.com/open-mpi/ompi/pull/264 with
>> git commit --allow-empty
>> and waited for Jenkins to do its job.
>>
>> the test failed, which means there is an issue in the master.
>> from the master point of view, it is good to know there is an issue.
>> from the PR point of view, this is a false positive since the PR does
>> nothing wrong.
>>
>> i was unable to find anything on github that indicates the master does
>> not pass the automated tests.
>> is such automated test running vs the master ? if yes, where can we find
>> the results ?
>> in order to avoid dealing with false positive, is there any possibility
>> to disable automated tests on the PR
>> if the master does not pass the tests ?
>>
>> Cheers,
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/11/16283.php
>>
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16284.php



Re: [OMPI devel] Jenkins vs master (and v1.8)

2014-11-11 Thread Mike Dubman
how about if I will disable the failing test(s) and make jenkins to pass?
It will help us to make sure we don`t break something that did work before?

On Tue, Nov 11, 2014 at 7:02 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Mike,
>
> Jenkins runs automated tests on each pull request, and i think this is a
> good thing.
>
> recently, it reported a bunch of failure but i could not find anything
> to blame in the PR itself.
>
> so i created a dummy PR https://github.com/open-mpi/ompi/pull/264 with
> git commit --allow-empty
> and waited for Jenkins to do its job.
>
> the test failed, which means there is an issue in the master.
> from the master point of view, it is good to know there is an issue.
> from the PR point of view, this is a false positive since the PR does
> nothing wrong.
>
> i was unable to find anything on github that indicates the master does
> not pass the automated tests.
> is such automated test running vs the master ? if yes, where can we find
> the results ?
> in order to avoid dealing with false positive, is there any possibility
> to disable automated tests on the PR
> if the master does not pass the tests ?
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16283.php
>



-- 

Kind Regards,

M.


[OMPI devel] Jenkins vs master (and v1.8)

2014-11-11 Thread Gilles Gouaillardet
Mike,

Jenkins runs automated tests on each pull request, and i think this is a
good thing.

recently, it reported a bunch of failure but i could not find anything
to blame in the PR itself.

so i created a dummy PR https://github.com/open-mpi/ompi/pull/264 with
git commit --allow-empty
and waited for Jenkins to do its job.

the test failed, which means there is an issue in the master.
from the master point of view, it is good to know there is an issue.
from the PR point of view, this is a false positive since the PR does
nothing wrong.

i was unable to find anything on github that indicates the master does
not pass the automated tests.
is such automated test running vs the master ? if yes, where can we find
the results ?
in order to avoid dealing with false positive, is there any possibility
to disable automated tests on the PR
if the master does not pass the tests ?

Cheers,

Gilles