[OMPI devel] NEWS bullets

2014-01-27 Thread Jeff Squyres (jsquyres)
Please review the NEWS bullets for 1.7.4 on the trunk, and add any missing 
items, make corrections, etc.

Ralph: in particular, I'm sure I missed some ORTE-related bullets.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.7.4rc2 is out

2014-01-27 Thread Ralph Castain
Just as an inducement: I believe 1.7.4 is complete at this time. The MTT runs 
look exceptionally clean, and we thoroughly beat this version up in the time 
since rc1.

So this is going to be a quick "smoke test" period prior to final release. 
Please give it a once-over to confirm nothing was inadvertently broken.

Barring any problems, release is scheduled for Fri 1/31

Thanks
Ralph

On Jan 27, 2014, at 7:54 PM, Jeff Squyres (jsquyres)  wrote:

> In the usual location:
> 
>http://www.open-mpi.org/software/ompi/v1.7/
> 
> Lots of changes since 1.7.4rc1, but we didn't keep a good NEWS file between 
> the two, so I can't list them all here.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] 1.7.4rc2 is out

2014-01-27 Thread Jeff Squyres (jsquyres)
In the usual location:

http://www.open-mpi.org/software/ompi/v1.7/

Lots of changes since 1.7.4rc1, but we didn't keep a good NEWS file between the 
two, so I can't list them all here.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Ralph Castain

On Jan 27, 2014, at 1:24 PM, Nathan Hjelm  wrote:

> On Mon, Jan 27, 2014 at 01:10:43PM -0800, Ralph Castain wrote:
>> Nathan, I have no idea what you are talking about. What I do know is that 
>> you had me commit a patch to the trunk and v1.7.4 that caused multiple 
>> warnings about 32-bit issues to appear in both cases. George is reporting 
>> issues that look very much like the ones I'd expect based on those warnings.
> 
> That patch for coll/ml was correct for 1.7.4 and no further action is
> required. The two crashed George reported are in 1) the new vader (not
> in 1.7.4), and 2) the coll/ml updates (also not in 1.7.4). Neither code
> path exists in 1.7.4.

My bad - I rechecked 1.7.4 and found the warnings indeed do not exist there.

> 
>> Release of 1.7.4 is still waiting for the patch you promised me last week to 
>> fix these problems. I don't give a rats a$$ about SGI at this stage - I just 
>> want to get your patch that fixes 1.7.4 so we can release the stupid thing!!!
> 
> That patch did not cause any warning on 1.7.4. Only the trunk because it
> conflicted with the update to coll/ml. That is not scheduled for 1.7.4
> so I am waiting on ORNL to make sure we do the right thing.
> 
> As far as I am concerned 1.7.4 should be ready to go. Unless some new
> 32-bit issue cropped up with coll/ml. Since MTT is suggested we are fine
> with 32-bit I doubt that is the case.

Brian is doing one last 32-bit smoke check to be sure as none of the MTT tests 
are running in that mode, so far as I'm aware.

Thanks
Ralph

> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Nathan Hjelm
On Mon, Jan 27, 2014 at 01:10:43PM -0800, Ralph Castain wrote:
> Nathan, I have no idea what you are talking about. What I do know is that you 
> had me commit a patch to the trunk and v1.7.4 that caused multiple warnings 
> about 32-bit issues to appear in both cases. George is reporting issues that 
> look very much like the ones I'd expect based on those warnings.

That patch for coll/ml was correct for 1.7.4 and no further action is
required. The two crashed George reported are in 1) the new vader (not
in 1.7.4), and 2) the coll/ml updates (also not in 1.7.4). Neither code
path exists in 1.7.4.

> Release of 1.7.4 is still waiting for the patch you promised me last week to 
> fix these problems. I don't give a rats a$$ about SGI at this stage - I just 
> want to get your patch that fixes 1.7.4 so we can release the stupid thing!!!

That patch did not cause any warning on 1.7.4. Only the trunk because it
conflicted with the update to coll/ml. That is not scheduled for 1.7.4
so I am waiting on ORNL to make sure we do the right thing.

As far as I am concerned 1.7.4 should be ready to go. Unless some new
32-bit issue cropped up with coll/ml. Since MTT is suggested we are fine
with 32-bit I doubt that is the case.

-Nathan


pgptKyz2VXJM1.pgp
Description: PGP signature


Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Ralph Castain
Nathan, I have no idea what you are talking about. What I do know is that you 
had me commit a patch to the trunk and v1.7.4 that caused multiple warnings 
about 32-bit issues to appear in both cases. George is reporting issues that 
look very much like the ones I'd expect based on those warnings.

Release of 1.7.4 is still waiting for the patch you promised me last week to 
fix these problems. I don't give a rats a$$ about SGI at this stage - I just 
want to get your patch that fixes 1.7.4 so we can release the stupid thing!!!

Thanks
Ralph

On Jan 27, 2014, at 1:04 PM, Nathan Hjelm  wrote:

> Nope. Vader will not work on non-xpmem systems in 1.7.4. The CMR is
> still open for 1.7.5 (#4053. Issue like the one George reported are
> why I chose to hold off on the new vader until 1.7.5.
> 
> The fix is complete. At this point I am waiting on some feedback on
> changes to OMPI_CHECK_PACKAGE before committing.
> 
> -Nathan
> 
> On Mon, Jan 27, 2014 at 12:55:27PM -0800, Ralph Castain wrote:
>> Just FWIW: I believe that problem did indeed make it over to 1.7.4, and that 
>> release is on "hold" pending your fix. So while I'm happy to hear about 
>> xpmem on SGI, please do let us release 1.7.4!
>> 
>> 
>> On Jan 27, 2014, at 8:19 AM, Nathan Hjelm  wrote:
>> 
>>> Yup. Has to do with not having 64-bit atomic math. The fix is complete
>>> but I am working on another update to enable using xpmem on SGI
>>> systems. I will push the changes once that is complete.
>>> 
>>> -Nathan
>>> 
>>> On Mon, Jan 27, 2014 at 04:00:08PM +, Jeff Squyres (jsquyres) wrote:
 Is this the same issue Absoft is seeing in 32 bit builds on the trunk?  
 (i.e., 100% failure rate)
 
   http://mtt.open-mpi.org/index.php?do_redir=2142
 
 
 On Jan 27, 2014, at 10:38 AM, Nathan Hjelm  wrote:
 
> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
> updates have been moved yet. As for trunk I am working on a 32-bit fix
> for vader and it should be in later today. I will have to track down
> what is going wrong the basesmuma initialization.
> 
> -Nathan
> 
> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
>> I noticed two major issues on 32 bits machines. The first one is with 
>> the vader BTL and the second with the selection logic in basesmuma 
>> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 
>> 1.7.
>> 
>> If I turn off vader and boll via the MCA parameters everything runs just 
>> fine.
>> 
>> George.
>> 
>> ../trunk/configure --enable-debug --disable-mpi-cxx 
>> --disable-mpi-fortran --disable-io-romio 
>> --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
>> 
>> 
>> - Vader generates a segfault for any application even with only 2 
>> processes, so this should be pretty easy to track.
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> (gdb) bt
>> #0  0x in ?? ()
>> #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
>>  at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
>> #2  0x00ae444a in mca_btl_vader_component_progress ()
>>  at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
>> #3  0x008fdb95 in opal_progress ()
>>  at ../../trunk/opal/runtime/opal_progress.c:186
>> #4  0x001961bc in ompi_request_default_test_some (count=13, 
>>  requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, 
>>  statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
>> #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, 
>>  outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
>>  at ptestsome.c:81
>> 
>> 
>> 
>> 
>> - basesmuma overwrite the memory. The results_array can’t be released as 
>> the memory is corrupted. I did not have time to investigate too much but 
>> it looks like the pload_mgmt->data_bffs either too small or somehow data 
>> is stored outside its boundaries.
>> 
>> *** glib detected *** 
>> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
>>  free(): invalid next size (fast): 0x081f0798 ***
>> 
>> (gdb) bt
>> #0  0x00130424 in __kernel_vsyscall ()
>> #1  0x006bfb11 in raise () from /lib/libc.so.6
>> #2  0x006c13ea in abort () from /lib/libc.so.6
>> #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
>> #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
>> #5  0x00708571 in _int_free () from /lib/libc.so.6
>> #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, 
>>  bcol_module=0xb30b3008, reg_data=0x81e6698)
>>  at 
>> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
>> #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
>>  at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:

Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Nathan Hjelm
Nope. Vader will not work on non-xpmem systems in 1.7.4. The CMR is
still open for 1.7.5 (#4053. Issue like the one George reported are
why I chose to hold off on the new vader until 1.7.5.

The fix is complete. At this point I am waiting on some feedback on
changes to OMPI_CHECK_PACKAGE before committing.

-Nathan

On Mon, Jan 27, 2014 at 12:55:27PM -0800, Ralph Castain wrote:
> Just FWIW: I believe that problem did indeed make it over to 1.7.4, and that 
> release is on "hold" pending your fix. So while I'm happy to hear about xpmem 
> on SGI, please do let us release 1.7.4!
> 
> 
> On Jan 27, 2014, at 8:19 AM, Nathan Hjelm  wrote:
> 
> > Yup. Has to do with not having 64-bit atomic math. The fix is complete
> > but I am working on another update to enable using xpmem on SGI
> > systems. I will push the changes once that is complete.
> > 
> > -Nathan
> > 
> > On Mon, Jan 27, 2014 at 04:00:08PM +, Jeff Squyres (jsquyres) wrote:
> >> Is this the same issue Absoft is seeing in 32 bit builds on the trunk?  
> >> (i.e., 100% failure rate)
> >> 
> >>http://mtt.open-mpi.org/index.php?do_redir=2142
> >> 
> >> 
> >> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm  wrote:
> >> 
> >>> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
> >>> updates have been moved yet. As for trunk I am working on a 32-bit fix
> >>> for vader and it should be in later today. I will have to track down
> >>> what is going wrong the basesmuma initialization.
> >>> 
> >>> -Nathan
> >>> 
> >>> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
>  I noticed two major issues on 32 bits machines. The first one is with 
>  the vader BTL and the second with the selection logic in basesmuma 
>  (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 
>  1.7.
>  
>  If I turn off vader and boll via the MCA parameters everything runs just 
>  fine.
>  
>  George.
>  
>  ../trunk/configure --enable-debug --disable-mpi-cxx 
>  --disable-mpi-fortran --disable-io-romio 
>  --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
>  
>  
>  - Vader generates a segfault for any application even with only 2 
>  processes, so this should be pretty easy to track.
>  
>  Program received signal SIGSEGV, Segmentation fault.
>  (gdb) bt
>  #0  0x in ?? ()
>  #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
>    at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
>  #2  0x00ae444a in mca_btl_vader_component_progress ()
>    at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
>  #3  0x008fdb95 in opal_progress ()
>    at ../../trunk/opal/runtime/opal_progress.c:186
>  #4  0x001961bc in ompi_request_default_test_some (count=13, 
>    requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, 
>    statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
>  #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, 
>    outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
>    at ptestsome.c:81
>  
>  
>  
>  
>  - basesmuma overwrite the memory. The results_array can’t be released as 
>  the memory is corrupted. I did not have time to investigate too much but 
>  it looks like the pload_mgmt->data_bffs either too small or somehow data 
>  is stored outside its boundaries.
>  
>  *** glib detected *** 
>  /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
>   free(): invalid next size (fast): 0x081f0798 ***
>  
>  (gdb) bt
>  #0  0x00130424 in __kernel_vsyscall ()
>  #1  0x006bfb11 in raise () from /lib/libc.so.6
>  #2  0x006c13ea in abort () from /lib/libc.so.6
>  #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
>  #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
>  #5  0x00708571 in _int_free () from /lib/libc.so.6
>  #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, 
>    bcol_module=0xb30b3008, reg_data=0x81e6698)
>    at 
>  ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
>  #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
>  #8  0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
>  #9  0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
>  #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, 
>  priority=0xbfffe558)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
>  #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, 
>    priority=0xbfffe558, module=0xbfffe580)
>    at ../../../../trunk/o

Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Paul Hargrove
Nathan,

To encourage you to focus on 1.7.4, I will delay testing vader on the SGI
UV until I've tested the next 1.7.4 release candidate (or final).

-Paul


On Mon, Jan 27, 2014 at 12:55 PM, Ralph Castain  wrote:

> Just FWIW: I believe that problem did indeed make it over to 1.7.4, and
> that release is on "hold" pending your fix. So while I'm happy to hear
> about xpmem on SGI, please do let us release 1.7.4!
>
>
> On Jan 27, 2014, at 8:19 AM, Nathan Hjelm  wrote:
>
> > Yup. Has to do with not having 64-bit atomic math. The fix is complete
> > but I am working on another update to enable using xpmem on SGI
> > systems. I will push the changes once that is complete.
> >
> > -Nathan
> >
> > On Mon, Jan 27, 2014 at 04:00:08PM +, Jeff Squyres (jsquyres) wrote:
> >> Is this the same issue Absoft is seeing in 32 bit builds on the trunk?
>  (i.e., 100% failure rate)
> >>
> >>http://mtt.open-mpi.org/index.php?do_redir=2142
> >>
> >>
> >> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm  wrote:
> >>
> >>> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
> >>> updates have been moved yet. As for trunk I am working on a 32-bit fix
> >>> for vader and it should be in later today. I will have to track down
> >>> what is going wrong the basesmuma initialization.
> >>>
> >>> -Nathan
> >>>
> >>> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
>  I noticed two major issues on 32 bits machines. The first one is with
> the vader BTL and the second with the selection logic in basesmuma
> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.
> 
>  If I turn off vader and boll via the MCA parameters everything runs
> just fine.
> 
>  George.
> 
>  ../trunk/configure --enable-debug --disable-mpi-cxx
> --disable-mpi-fortran --disable-io-romio
> --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default
> 
> 
>  - Vader generates a segfault for any application even with only 2
> processes, so this should be pretty easy to track.
> 
>  Program received signal SIGSEGV, Segmentation fault.
>  (gdb) bt
>  #0  0x in ?? ()
>  #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
>    at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
>  #2  0x00ae444a in mca_btl_vader_component_progress ()
>    at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
>  #3  0x008fdb95 in opal_progress ()
>    at ../../trunk/opal/runtime/opal_progress.c:186
>  #4  0x001961bc in ompi_request_default_test_some (count=13,
>    requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60,
>    statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
>  #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48,
>    outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
>    at ptestsome.c:81
> 
> 
> 
> 
>  - basesmuma overwrite the memory. The results_array can’t be released
> as the memory is corrupted. I did not have time to investigate too much but
> it looks like the pload_mgmt->data_bffs either too small or somehow data is
> stored outside its boundaries.
> 
>  *** glib detected ***
> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
> free(): invalid next size (fast): 0x081f0798 ***
> 
>  (gdb) bt
>  #0  0x00130424 in __kernel_vsyscall ()
>  #1  0x006bfb11 in raise () from /lib/libc.so.6
>  #2  0x006c13ea in abort () from /lib/libc.so.6
>  #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
>  #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
>  #5  0x00708571 in _int_free () from /lib/libc.so.6
>  #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60,
>    bcol_module=0xb30b3008, reg_data=0x81e6698)
>    at
> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
>  #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
>  #8  0x00b70651 in ml_module_memory_initialization
> (ml_module=0x81dfe60)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
>  #9  0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
>  #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0,
> priority=0xbfffe558)
>    at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
>  #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0,
>    priority=0xbfffe558, module=0xbfffe580)
>    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
>  #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0,
>    priority=0xbfffe558, module=0xbfffe580)
>  ---Type  to continue, or q  to quit---
>    at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:35

Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Ralph Castain
Just FWIW: I believe that problem did indeed make it over to 1.7.4, and that 
release is on "hold" pending your fix. So while I'm happy to hear about xpmem 
on SGI, please do let us release 1.7.4!


On Jan 27, 2014, at 8:19 AM, Nathan Hjelm  wrote:

> Yup. Has to do with not having 64-bit atomic math. The fix is complete
> but I am working on another update to enable using xpmem on SGI
> systems. I will push the changes once that is complete.
> 
> -Nathan
> 
> On Mon, Jan 27, 2014 at 04:00:08PM +, Jeff Squyres (jsquyres) wrote:
>> Is this the same issue Absoft is seeing in 32 bit builds on the trunk?  
>> (i.e., 100% failure rate)
>> 
>>http://mtt.open-mpi.org/index.php?do_redir=2142
>> 
>> 
>> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm  wrote:
>> 
>>> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
>>> updates have been moved yet. As for trunk I am working on a 32-bit fix
>>> for vader and it should be in later today. I will have to track down
>>> what is going wrong the basesmuma initialization.
>>> 
>>> -Nathan
>>> 
>>> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
 I noticed two major issues on 32 bits machines. The first one is with the 
 vader BTL and the second with the selection logic in basesmuma 
 (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.
 
 If I turn off vader and boll via the MCA parameters everything runs just 
 fine.
 
 George.
 
 ../trunk/configure --enable-debug --disable-mpi-cxx --disable-mpi-fortran 
 --disable-io-romio --enable-contrib-no-build=vt,libtrace 
 --enable-mpirun-prefix-by-default
 
 
 - Vader generates a segfault for any application even with only 2 
 processes, so this should be pretty easy to track.
 
 Program received signal SIGSEGV, Segmentation fault.
 (gdb) bt
 #0  0x in ?? ()
 #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
   at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
 #2  0x00ae444a in mca_btl_vader_component_progress ()
   at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
 #3  0x008fdb95 in opal_progress ()
   at ../../trunk/opal/runtime/opal_progress.c:186
 #4  0x001961bc in ompi_request_default_test_some (count=13, 
   requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, 
   statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
 #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, 
   outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
   at ptestsome.c:81
 
 
 
 
 - basesmuma overwrite the memory. The results_array can’t be released as 
 the memory is corrupted. I did not have time to investigate too much but 
 it looks like the pload_mgmt->data_bffs either too small or somehow data 
 is stored outside its boundaries.
 
 *** glib detected *** 
 /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
  free(): invalid next size (fast): 0x081f0798 ***
 
 (gdb) bt
 #0  0x00130424 in __kernel_vsyscall ()
 #1  0x006bfb11 in raise () from /lib/libc.so.6
 #2  0x006c13ea in abort () from /lib/libc.so.6
 #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
 #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
 #5  0x00708571 in _int_free () from /lib/libc.so.6
 #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, 
   bcol_module=0xb30b3008, reg_data=0x81e6698)
   at 
 ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
 #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
   at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
 #8  0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60)
   at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
 #9  0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
   at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
 #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, 
 priority=0xbfffe558)
   at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
 #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, 
   priority=0xbfffe558, module=0xbfffe580)
   at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
 #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0, 
   priority=0xbfffe558, module=0xbfffe580)
 ---Type  to continue, or q  to quit---
   at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358
 #13 0x00202d9e in check_one_component (comm=0x8127da0, component=0xbc6500, 
   module=0xbfffe580)
   at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320
 #14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0)
   at ../../../../trunk/ompi/mca/coll/base

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r30391 - in trunk: . config oshmem oshmem/shmem/fortran oshmem/tools/oshmem_info

2014-01-27 Thread Mike Dubman
Thanks Jeff,
we will check and address it.


On Fri, Jan 24, 2014 at 7:13 PM, Jeff Squyres (jsquyres)  wrote:

> Mellanox --
>
> Some comments on the commit below.
>
>
> On Jan 23, 2014, at 2:49 AM, svn-commit-mai...@open-mpi.org wrote:
>
> > Modified: trunk/config/oshmem_configure_options.m4
> >
> ==
> > --- trunk/config/oshmem_configure_options.m4  Thu Jan 23 02:29:23 2014
>  (r30390)
> > +++ trunk/config/oshmem_configure_options.m4  2014-01-23 02:49:13 EST
> (Thu, 23 Jan 2014)  (r30391)
> > @@ -95,19 +95,19 @@
> >[enable OSHMEM Fortran bindings (default: enabled if
> Fortran compiler found)]))
> > if test "$enable_oshmem_fortran" != "no" -a "$ompi_fortran_happy" = 1;
> then
> > # If no OMPI FORTRAN, bail
> > -   AS_IF([test $OMPI_WANT_FORTRAN_BINDINGS -eq 0],
> > +   AS_IF([test $OMPI_WANT_FORTRAN_BINDINGS -eq 0 -a "$enable_oshmem" !=
> "no"],
> >[AC_MSG_RESULT([bad value OMPI_WANT_FORTRAN_BINDINGS:
> ($OMPI_WANT_FORTRAN_BINDINGS)])
> > AC_MSG_WARN([Your explicit request to
> --enable-oshmem-fortran can only be satisfied if fortran support is enabled
> in OMPI. You are seeing this message for one of two reasons:
> > 1. OMPI fortran support has been explicitly disabled
> via --disable-mpi-fortran, in which case you cannot
> --enable-oshmem-fortran. Configure will abort because you, a human, have
> explicitly asked for something that cannot be provided.
> > 2. OMPI fortran support is implicitly not being
> built because no fortran compiler could be found on your system. Configure
> will abort because you, a human, have explicitly asked for something that
> cannot be provided.])
>
> As you noted in a comment below, we haven't searched for a Fortran
> compiler yet.  So the above message isn't correct.  Specifically: you seem
> to be entering this code path only for case #1.
>
> Please update the AC_MSG_WARN message.
>
> > AC_MSG_ERROR([Cannot continue])])
> > AC_MSG_RESULT([yes])
> > -OSHMEM_FORTRAN_BINDINGS=1
> > else
> > AC_MSG_RESULT([no])
> > -OSHMEM_FORTRAN_BINDINGS=0
> > fi
> > -AM_CONDITIONAL(OSHMEM_WANT_FORTRAN_BINDINGS,
> > -[test $OSHMEM_FORTRAN_BINDINGS -eq 1])
> > +
> > +#
> > +# We can't set am_conditional here since it's yet unknown if there is
> valid Fortran compiler avaliable
> > +#
> > ]) dnl
> >
> > Modified: trunk/configure.ac
> >
> ==
> > --- trunk/configure.acThu Jan 23 02:29:23 2014(r30390)
> > +++ trunk/configure.ac2014-01-23 02:49:13 EST (Thu, 23 Jan
> 2014)  (r30391)
> > @@ -1273,6 +1273,11 @@
> > # a C++ compiler.
> > AS_IF([test "$OMPI_WANT_FORTRAN_BINDINGS" != "1"],[F77=no FC=no])
> >
> > +AM_CONDITIONAL(OSHMEM_BUILD_FORTRAN_BINDINGS,
> > +[test "$ompi_fortran_happy" == "1" -a \
> > +  "$OMPI_WANT_FORTRAN_BINDINGS" == "1" -a \
> > +  "$oshmem_fortran_enable" != "no"])
> > +
> > LT_CONFIG_LTDL_DIR([opal/libltdl], [subproject])
> > LTDL_CONVENIENCE
> > LT_INIT([dlopen win32-dll])
>
> This seems like the wrong place in configure.ac to put this check -- you
> put this OSHMEM check right in the middle of the libtool setup code (the
> Fortran check here is part of the libtool setup; see the comment right
> before that in configure.ac that describes what's going on).
>
> Why not put the OSHMEM check way up near/after the call to setup the
> Fortran MPI stuff?
>
> > Modified: trunk/oshmem/Makefile.am
> >
> ==
> > --- trunk/oshmem/Makefile.am  Thu Jan 23 02:29:23 2014(r30390)
> > +++ trunk/oshmem/Makefile.am  2014-01-23 02:49:13 EST (Thu, 23 Jan 2014)
>  (r30391)
> > @@ -17,7 +17,7 @@
> > endif
> >
> > # Do we have the Fortran bindings?
> > -if OSHMEM_WANT_FORTRAN_BINDINGS
> > +if OSHMEM_BUILD_FORTRAN_BINDINGS
> > fortran_oshmem_lib = shmem/fortran/liboshmem_fortran.la
> > else
> > fortran_oshmem_lib =
> >
> > Modified: trunk/oshmem/shmem/fortran/Makefile.am
> >
> ==
> > --- trunk/oshmem/shmem/fortran/Makefile.amThu Jan 23 02:29:23 2014
>  (r30390)
> > +++ trunk/oshmem/shmem/fortran/Makefile.am2014-01-23 02:49:13 EST
> (Thu, 23 Jan 2014)  (r30391)
> > @@ -11,7 +11,7 @@
> >
> > AM_CPPFLAGS = -DOSHMEM_PROFILING_DEFINES=0
> -DOSHMEM_HAVE_WEAK_SYMBOLS=0
> >
> > -if OSHMEM_WANT_FORTRAN_BINDINGS
> > +if OSHMEM_BUILD_FORTRAN_BINDINGS
> > oshmem_fortran_lib   = liboshmem_fortran.la
> > else
> > oshmem_fortran_lib   =
> >
> > Modified: trunk/oshmem/tools/oshmem_info/Makefile.am
> >
> ==
> > --- trunk/oshmem/tools/oshmem_info/Makefile.amThu Jan 23
> 02:29:23 2014(r30390)
> > +++ trunk/oshmem/t

Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Nathan Hjelm
Yup. Has to do with not having 64-bit atomic math. The fix is complete
but I am working on another update to enable using xpmem on SGI
systems. I will push the changes once that is complete.

-Nathan

On Mon, Jan 27, 2014 at 04:00:08PM +, Jeff Squyres (jsquyres) wrote:
> Is this the same issue Absoft is seeing in 32 bit builds on the trunk?  
> (i.e., 100% failure rate)
> 
> http://mtt.open-mpi.org/index.php?do_redir=2142
> 
> 
> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm  wrote:
> 
> > This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
> > updates have been moved yet. As for trunk I am working on a 32-bit fix
> > for vader and it should be in later today. I will have to track down
> > what is going wrong the basesmuma initialization.
> > 
> > -Nathan
> > 
> > On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
> >> I noticed two major issues on 32 bits machines. The first one is with the 
> >> vader BTL and the second with the selection logic in basesmuma 
> >> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.
> >> 
> >> If I turn off vader and boll via the MCA parameters everything runs just 
> >> fine.
> >> 
> >>  George.
> >> 
> >> ../trunk/configure --enable-debug --disable-mpi-cxx --disable-mpi-fortran 
> >> --disable-io-romio --enable-contrib-no-build=vt,libtrace 
> >> --enable-mpirun-prefix-by-default
> >> 
> >> 
> >> - Vader generates a segfault for any application even with only 2 
> >> processes, so this should be pretty easy to track.
> >> 
> >> Program received signal SIGSEGV, Segmentation fault.
> >> (gdb) bt
> >> #0  0x in ?? ()
> >> #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
> >>at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
> >> #2  0x00ae444a in mca_btl_vader_component_progress ()
> >>at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
> >> #3  0x008fdb95 in opal_progress ()
> >>at ../../trunk/opal/runtime/opal_progress.c:186
> >> #4  0x001961bc in ompi_request_default_test_some (count=13, 
> >>requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, 
> >>statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
> >> #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, 
> >>outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
> >>at ptestsome.c:81
> >> 
> >> 
> >> 
> >> 
> >> - basesmuma overwrite the memory. The results_array can’t be released as 
> >> the memory is corrupted. I did not have time to investigate too much but 
> >> it looks like the pload_mgmt->data_bffs either too small or somehow data 
> >> is stored outside its boundaries.
> >> 
> >> *** glib detected *** 
> >> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
> >>  free(): invalid next size (fast): 0x081f0798 ***
> >> 
> >> (gdb) bt
> >> #0  0x00130424 in __kernel_vsyscall ()
> >> #1  0x006bfb11 in raise () from /lib/libc.so.6
> >> #2  0x006c13ea in abort () from /lib/libc.so.6
> >> #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
> >> #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
> >> #5  0x00708571 in _int_free () from /lib/libc.so.6
> >> #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, 
> >>bcol_module=0xb30b3008, reg_data=0x81e6698)
> >>at 
> >> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
> >> #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
> >>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
> >> #8  0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60)
> >>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
> >> #9  0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
> >>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
> >> #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, 
> >> priority=0xbfffe558)
> >>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
> >> #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, 
> >>priority=0xbfffe558, module=0xbfffe580)
> >>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
> >> #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0, 
> >>priority=0xbfffe558, module=0xbfffe580)
> >> ---Type  to continue, or q  to quit---
> >>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358
> >> #13 0x00202d9e in check_one_component (comm=0x8127da0, component=0xbc6500, 
> >>module=0xbfffe580)
> >>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320
> >> #14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0)
> >>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284
> >> #15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0)
> >>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117
> >> #16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0, 
> 

[OMPI devel] SNAPC: dynamic send buffers

2014-01-27 Thread Adrian Reber
I have the following patches which I would like to commit. All changes
are in the SNAPC component. The first patch replaces all statically
allocated buffers with dynamically allocate buffers. The second patch
removes compiler warnings and the last patch tries to re-introduce
functionality which I removed with my 'getting-it-compiled-again'
patches. Instead of blocking recv() calls it now uses
ORTE_WAIT_FOR_COMPLETION(). I included gitweb links to the patches.

Please have a look at the patches.

Adrian

commit 6f10b44499b59c84d9032378c7f8c6b3526a029b
Author: Adrian Reber 
List-Post: devel@lists.open-mpi.org
Date:   Sun Jan 26 12:10:41 2014 +0100

SNAPC: use dynamic buffers for rml.send and rml.recv

The snapc component was still using static buffers
for send_buffer_nb(). This patch changes opal_buffer_t buffer;
to opal_buffer_t *buffer;

 orte/mca/snapc/full/snapc_full_app.c| 119 
+++---
 orte/mca/snapc/full/snapc_full_global.c |  73 

 orte/mca/snapc/full/snapc_full_local.c  |  33 +++--
 3 files changed, 114 insertions(+), 111 deletions(-)

  
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=6f10b44499b59c84d9032378c7f8c6b3526a029b

commit 218d04ad663ad76ad23cd99b62e83c435ccfe418
Author: Adrian Reber 
List-Post: devel@lists.open-mpi.org
Date:   Mon Jan 27 12:49:30 2014 +0100

SNAPC: remove compiler warnings

 orte/mca/snapc/full/snapc_full_global.c | 19 +--
 orte/mca/snapc/full/snapc_full_local.c  | 29 ++---
 2 files changed, 11 insertions(+), 37 deletions(-)

  
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=218d04ad663ad76ad23cd99b62e83c435ccfe418

commit 67d435cbe5df5c59519d605ce25443880244d2d5
Author: Adrian Reber 
List-Post: devel@lists.open-mpi.org
Date:   Mon Jan 27 14:31:36 2014 +0100

use ORTE_WAIT_FOR_COMPLETION with non-blocking receives

During the commits to make the C/R code compile again the
blocking receive calls in snapc_full_app.c were
replaced by non-blocking receive calls with a dummy callback
function. This commit adds ORTE_WAIT_FOR_COMPLETION()
after each non-blocking receive to wait for the data.

 orte/mca/snapc/full/snapc_full_app.c | 56 
+---
 1 file changed, 17 insertions(+), 39 deletions(-)

  
https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=67d435cbe5df5c59519d605ce25443880244d2d5


Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Jeff Squyres (jsquyres)
Is this the same issue Absoft is seeing in 32 bit builds on the trunk?  (i.e., 
100% failure rate)

http://mtt.open-mpi.org/index.php?do_redir=2142


On Jan 27, 2014, at 10:38 AM, Nathan Hjelm  wrote:

> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
> updates have been moved yet. As for trunk I am working on a 32-bit fix
> for vader and it should be in later today. I will have to track down
> what is going wrong the basesmuma initialization.
> 
> -Nathan
> 
> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
>> I noticed two major issues on 32 bits machines. The first one is with the 
>> vader BTL and the second with the selection logic in basesmuma 
>> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.
>> 
>> If I turn off vader and boll via the MCA parameters everything runs just 
>> fine.
>> 
>>  George.
>> 
>> ../trunk/configure --enable-debug --disable-mpi-cxx --disable-mpi-fortran 
>> --disable-io-romio --enable-contrib-no-build=vt,libtrace 
>> --enable-mpirun-prefix-by-default
>> 
>> 
>> - Vader generates a segfault for any application even with only 2 processes, 
>> so this should be pretty easy to track.
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> (gdb) bt
>> #0  0x in ?? ()
>> #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
>>at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
>> #2  0x00ae444a in mca_btl_vader_component_progress ()
>>at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
>> #3  0x008fdb95 in opal_progress ()
>>at ../../trunk/opal/runtime/opal_progress.c:186
>> #4  0x001961bc in ompi_request_default_test_some (count=13, 
>>requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, 
>>statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
>> #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, 
>>outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
>>at ptestsome.c:81
>> 
>> 
>> 
>> 
>> - basesmuma overwrite the memory. The results_array can’t be released as the 
>> memory is corrupted. I did not have time to investigate too much but it 
>> looks like the pload_mgmt->data_bffs either too small or somehow data is 
>> stored outside its boundaries.
>> 
>> *** glib detected *** 
>> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
>>  free(): invalid next size (fast): 0x081f0798 ***
>> 
>> (gdb) bt
>> #0  0x00130424 in __kernel_vsyscall ()
>> #1  0x006bfb11 in raise () from /lib/libc.so.6
>> #2  0x006c13ea in abort () from /lib/libc.so.6
>> #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
>> #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
>> #5  0x00708571 in _int_free () from /lib/libc.so.6
>> #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, 
>>bcol_module=0xb30b3008, reg_data=0x81e6698)
>>at 
>> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
>> #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
>>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
>> #8  0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60)
>>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
>> #9  0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
>>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
>> #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, 
>> priority=0xbfffe558)
>>at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
>> #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, 
>>priority=0xbfffe558, module=0xbfffe580)
>>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
>> #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0, 
>>priority=0xbfffe558, module=0xbfffe580)
>> ---Type  to continue, or q  to quit---
>>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358
>> #13 0x00202d9e in check_one_component (comm=0x8127da0, component=0xbc6500, 
>>module=0xbfffe580)
>>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320
>> #14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0)
>>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284
>> #15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0)
>>at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117
>> #16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0, 
>>provided=0xbfffe970) at ../../trunk/ompi/runtime/ompi_mpi_init.c:894
>> #17 0x001c9509 in PMPI_Init (argc=0xbfffe9c0, argv=0xbfffe9c4) at pinit.c:84
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
J

Re: [OMPI devel] 1.7.4 (and trunk) breakages on 32bits architectures

2014-01-27 Thread Nathan Hjelm
This shouldn't be affecting 1.7.4 since neither the vader or coll/ml
updates have been moved yet. As for trunk I am working on a 32-bit fix
for vader and it should be in later today. I will have to track down
what is going wrong the basesmuma initialization.

-Nathan

On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote:
> I noticed two major issues on 32 bits machines. The first one is with the 
> vader BTL and the second with the selection logic in basesmuma 
> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7.
> 
> If I turn off vader and boll via the MCA parameters everything runs just fine.
> 
>   George.
> 
> ../trunk/configure --enable-debug --disable-mpi-cxx --disable-mpi-fortran 
> --disable-io-romio --enable-contrib-no-build=vt,libtrace 
> --enable-mpirun-prefix-by-default
> 
> 
> - Vader generates a segfault for any application even with only 2 processes, 
> so this should be pretty easy to track.
> 
> Program received signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0  0x in ?? ()
> #1  0x00ae43b3 in mca_btl_vader_poll_fifo ()
> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394
> #2  0x00ae444a in mca_btl_vader_component_progress ()
> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421
> #3  0x008fdb95 in opal_progress ()
> at ../../trunk/opal/runtime/opal_progress.c:186
> #4  0x001961bc in ompi_request_default_test_some (count=13, 
> requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, 
> statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316
> #5  0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, 
> outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178)
> at ptestsome.c:81
> 
> 
> 
> 
> - basesmuma overwrite the memory. The results_array can’t be released as the 
> memory is corrupted. I did not have time to investigate too much but it looks 
> like the pload_mgmt->data_bffs either too small or somehow data is stored 
> outside its boundaries.
> 
> *** glib detected *** 
> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf:
>  free(): invalid next size (fast): 0x081f0798 ***
> 
> (gdb) bt
> #0  0x00130424 in __kernel_vsyscall ()
> #1  0x006bfb11 in raise () from /lib/libc.so.6
> #2  0x006c13ea in abort () from /lib/libc.so.6
> #3  0x006ff9d5 in __libc_message () from /lib/libc.so.6
> #4  0x00705e31 in malloc_printerr () from /lib/libc.so.6
> #5  0x00708571 in _int_free () from /lib/libc.so.6
> #6  0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, 
> bcol_module=0xb30b3008, reg_data=0x81e6698)
> at 
> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472
> #7  0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60)
> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513
> #8  0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60)
> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560
> #9  0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60)
> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585
> #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, priority=0xbfffe558)
> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998
> #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, 
> priority=0xbfffe558, module=0xbfffe580)
> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375
> #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0, 
> priority=0xbfffe558, module=0xbfffe580)
> ---Type  to continue, or q  to quit---
> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358
> #13 0x00202d9e in check_one_component (comm=0x8127da0, component=0xbc6500, 
> module=0xbfffe580)
> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320
> #14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0)
> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284
> #15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0)
> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117
> #16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0, 
> provided=0xbfffe970) at ../../trunk/ompi/runtime/ompi_mpi_init.c:894
> #17 0x001c9509 in PMPI_Init (argc=0xbfffe9c0, argv=0xbfffe9c4) at pinit.c:84
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


pgpAR_nh7Yvpi.pgp
Description: PGP signature


Re: [OMPI devel] more f77 cruft

2014-01-27 Thread Jeff Squyres (jsquyres)
Good point.  Done.


On Jan 24, 2014, at 6:54 PM, Paul Hargrove  wrote:

> Noticed the following in autogen output:
> 
> === Patching configure for Sun Studio Fortran version strings (_F77)
> 
> While probably harmless, my own inclination would be to apply as few patches 
> to the generated configure script as possible.  If libtool's F77 tag is 
> actually unused then perhaps this can be removed?
> 
> This is NOT a suggestion for v1.7.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/