Re: [OMPI devel] singleton appears to be broken

2014-02-07 Thread Ralph Castain
Think I see the code path that causes this - I'll have to play with it a little 
as the race condition is biased heavily towards success, so (as you noted) it 
won't happen very often.

On Feb 6, 2014, at 6:38 PM, Ralph Castain  wrote:

> Interesting - does it happen in finalize, or in the middle of execution?
> 
> 
> On Feb 6, 2014, at 5:57 PM, George Bosilca  wrote:
> 
>> Out of 150 runs I could reproduce it once. When it failed I got exactly the 
>> same assert:
>> 
>> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
>> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) 
>> == ((opal_object_t *) (recv))->obj_magic_id’ failed.
>> 
>> A quick look at the code indicates it is in a rather obscure execution path, 
>> when one cancel a pending receive. The assert indicates that the receive 
>> object was already destroyed (somewhere else) when it got removed from the 
>> orte_rml_base.posted_recvs queue.
>> 
>> George.
>> 
>> 
>> On Feb 7, 2014, at 02:22 , George Bosilca  wrote:
>> 
>>> A rather long configure line:
>>> 
>>> ./configure —enable-picky —enable-debug —enable-coverage 
>>> —disable-heterogeneous —enable-visibility —enable-contrib-no-build=vt 
>>> —enable-mpirun-prefix-by-default --disable-mpi-cxx --with-cma 
>>> --enable-static 
>>> --enable-mca-no-build=plm-tm,ess-tm,ras-tm,plm-tm,ras-slurm,ess-slurm,plm-slurm,btl-sctp
>>> 
>>> And hellow_world.c from ompi-tests compiled using: 
>>> mpicc -g —coverage hello.c -o hello
>>> 
>>> George.
>>> 
>>> 
>>> On Feb 7, 2014, at 01:11 , Ralph Castain  wrote:
>>> 
 Oh, should have noted: that's on both trunk and 1.7.4
 
 On Feb 6, 2014, at 4:10 PM, Ralph Castain  wrote:
 
> Works for me on Mac and Linux/Centos6.2 as well
> 
> 
> On Feb 6, 2014, at 4:00 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How 
>> did you configure?  Here's my configure:
>> 
>> ./configure --prefix=/home/jsquyres/bogus --disable-vt 
>> --enable-mpirun-prefix-by-default --disable-mpi-fortran
>> 
>> Does this happen with every run?
>> 
>> 
>> On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:
>> 
>>> A singleton hello_world assert with the following output:
>>> 
>>> Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the 
>>> list 0x7f2cd9161ae0
>>> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
>>> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 
>>> 0xdeafbeedULL) == ((opal_object_t *) (recv))->obj_magic_id' failed.
>>> [dancer:00698] *** Process received signal ***
>>> [dancer:00698] Signal: Aborted (6)
>>> [dancer:00698] Signal code:  (-6)
>>> [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
>>> [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
>>> [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
>>> [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
>>> [dancer:00698] [ 4] 
>>> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
>>> [dancer:00698] [ 5] 
>>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
>>> [dancer:00698] [ 6] 
>>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
>>> [dancer:00698] [ 7] 
>>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
>>> [dancer:00698] [ 8] 
>>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
>>> [dancer:00698] [ 9] 
>>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
>>> [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
>>> [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
>>> [dancer:00698] *** End of error message ***
>>> 
>>> The same executable run via mpirun with a single process succeed. This 
>>> is with trunk, I did not tried with the release.
>>> 
>>> George.
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 



Re: [OMPI devel] singleton appears to be broken

2014-02-07 Thread George Bosilca
It is difficult to see it from the stack trace, as it happens in the ORTE 
threads. But I do have all the output I expect, and as the application I was 
running is hello_world I’m almost certain it happens during MPI_Finalize.

  George.

On Feb 7, 2014, at 03:38 , Ralph Castain  wrote:

> Interesting - does it happen in finalize, or in the middle of execution?
> 
> 
> On Feb 6, 2014, at 5:57 PM, George Bosilca  wrote:
> 
>> Out of 150 runs I could reproduce it once. When it failed I got exactly the 
>> same assert:
>> 
>> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
>> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) 
>> == ((opal_object_t *) (recv))->obj_magic_id’ failed.
>> 
>> A quick look at the code indicates it is in a rather obscure execution path, 
>> when one cancel a pending receive. The assert indicates that the receive 
>> object was already destroyed (somewhere else) when it got removed from the 
>> orte_rml_base.posted_recvs queue.
>> 
>> George.
>> 
>> 
>> On Feb 7, 2014, at 02:22 , George Bosilca  wrote:
>> 
>>> A rather long configure line:
>>> 
>>> ./configure —enable-picky —enable-debug —enable-coverage 
>>> —disable-heterogeneous —enable-visibility —enable-contrib-no-build=vt 
>>> —enable-mpirun-prefix-by-default --disable-mpi-cxx --with-cma 
>>> --enable-static 
>>> --enable-mca-no-build=plm-tm,ess-tm,ras-tm,plm-tm,ras-slurm,ess-slurm,plm-slurm,btl-sctp
>>> 
>>> And hellow_world.c from ompi-tests compiled using: 
>>> mpicc -g —coverage hello.c -o hello
>>> 
>>> George.
>>> 
>>> 
>>> On Feb 7, 2014, at 01:11 , Ralph Castain  wrote:
>>> 
 Oh, should have noted: that's on both trunk and 1.7.4
 
 On Feb 6, 2014, at 4:10 PM, Ralph Castain  wrote:
 
> Works for me on Mac and Linux/Centos6.2 as well
> 
> 
> On Feb 6, 2014, at 4:00 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> I'm unable to replicate on Linux/RHEL/64 bit with a trunk build.  How 
>> did you configure?  Here's my configure:
>> 
>> ./configure --prefix=/home/jsquyres/bogus --disable-vt 
>> --enable-mpirun-prefix-by-default --disable-mpi-fortran
>> 
>> Does this happen with every run?
>> 
>> 
>> On Feb 6, 2014, at 6:53 PM, George Bosilca  wrote:
>> 
>>> A singleton hello_world assert with the following output:
>>> 
>>> Warning :: opal_list_remove_item - the item 0x1211fc0 is not on the 
>>> list 0x7f2cd9161ae0
>>> hello: ../../../../ompi/orte/mca/rml/base/rml_base_msg_handlers.c:75: 
>>> orte_rml_base_post_recv: Assertion `((0xdeafbeedULL << 32) + 
>>> 0xdeafbeedULL) == ((opal_object_t *) (recv))->obj_magic_id' failed.
>>> [dancer:00698] *** Process received signal ***
>>> [dancer:00698] Signal: Aborted (6)
>>> [dancer:00698] Signal code:  (-6)
>>> [dancer:00698] [ 0] /lib64/libpthread.so.0[0x3d8480f710]
>>> [dancer:00698] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d83c32925]
>>> [dancer:00698] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d83c34105]
>>> [dancer:00698] [ 3] /lib64/libc.so.6[0x3d83c2ba4e]
>>> [dancer:00698] [ 4] 
>>> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x3d83c2bb10]
>>> [dancer:00698] [ 5] 
>>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(orte_rml_base_post_recv+0x252)[0x7f2cd8e76d55]
>>> [dancer:00698] [ 6] 
>>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcca5d)[0x7f2cd89e8a5d]
>>> [dancer:00698] [ 7] 
>>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(+0xcce53)[0x7f2cd89e8e53]
>>> [dancer:00698] [ 8] 
>>> /home/bosilca/opt/trunk/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x4eb)[0x7f2cd89e99ea]
>>> [dancer:00698] [ 9] 
>>> /home/bosilca/opt/trunk/lib/libopen-rte.so.0(+0x28725)[0x7f2cd8d1b725]
>>> [dancer:00698] [10] /lib64/libpthread.so.0[0x3d848079d1]
>>> [dancer:00698] [11] /lib64/libc.so.6(clone+0x6d)[0x3d83ce8b6d]
>>> [dancer:00698] *** End of error message ***
>>> 
>>> The same executable run via mpirun with a single process succeed. This 
>>> is with trunk, I did not tried with the release.
>>> 
>>> George.
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/list

Re: [OMPI devel] C/R and orte_oob

2014-02-07 Thread Josh Hursey
In the original implementation, the OOB ft_event did not do much of
anything on checkpoint preparation and continue. We did not even close the
sockets. However, during restart the OOB will need to renegotiate the
socket connections - usually by calling the finalization function (close
stale sockets) and then the initialization function (create new sockets) of
that component.

I'm not sure if that is still an acceptable approach or not. We will want
the OOB to be quieted across the checkpoint preparation and continue so
that we don't lose any message or have messages cross the checkpoint line.
So something maybe to return to in the next pass.




On Thu, Feb 6, 2014 at 4:45 PM, Ralph Castain  wrote:

>
> On Feb 6, 2014, at 2:16 PM, Adrian Reber  wrote:
>
> > Josh explained it to me a few days ago, that after a checkpoint has been
> > received TCP should no longer be used to not lose any messages. The
> > communication happens over named pipes and therefore (I think) OOB
> > ft_event() is used to quite anything besides the pipes. This all seems
> > to work but I was just confused as the functions for ft_event()
> > in oob/tcp and oob/ud do not seem to contain any functionality.
> >
> > So do I try to fix the ft_event() function in oob/base/ to call the
> > registered ft_event() function which does nothing or do I just remove
> > the call to orte oob ft_event().
>
> Sounds like you'll need to tell the OOB components to stop processing
> messages, so that will require that you insert an event into the system.
> You have to account for two things:
>
> (a) the OOB base and OOB components are operating on the orte_event_base,
> but
>
> (b) each OOB component can have multiple active modules (one per NIC) that
> are operating on their own event base/thread.
>
> So you have to start by pushing an event that calls the OOB base, which
> then loops across the components calling their ft_event interface. Each
> component would then have to create an event for each active module,
> inserting that event into the module's event base/thread. When activated,
> each module would have to shutdown its message engine, and activate another
> event to notify its component that all is quiet.
>
> Once a component finds out that all its modules are quiet, it would then
> have to activate an event to the OOB base. Once the OOB base sees all
> components report quiet, then it would have to activate an event to take
> you to the next step in your process.
>
> In other words, you need to turn the quieting process into its own set of
> states and run it through the state machine. This is the only way to
> guarantee that you'll keep things orderly, and is the major change needed
> in the C/R procedure as it flows thru ORTE. You can't just progress thru a
> set of function calls as you'll inevitably run into a roadblock requiring
> that you wait for an event-driven process to complete.
>
> HTH
> Ralph
>
> >
> > On Thu, Feb 06, 2014 at 10:49:25AM -0800, Ralph Castain wrote:
> >> The only reason I can think of for an OOB ft-event would be to tell the
> OOB to stop sending any messages. You would need to push that into the
> event library and use a callback event to let you know when it was done.
> >>
> >> Of course, once you did that, the OOB would no longer be available to,
> for example, tell the local daemon that the app is ready for checkpoint :-)
> >>
> >> Afraid I'll have to defer to Josh H for any further guidance.
> >>
> >>
> >> On Feb 6, 2014, at 8:15 AM, Adrian Reber  wrote:
> >>
> >>> When I initially made the C/R code compile again I made following
> >>> change:
> >>>
> >>> diff --git a/orte/mca/rml/oob/rml_oob_component.c
> b/orte/mca/rml/oob/rml_oob_component.c
> >>> index f0b22fc..90ed086 100644
> >>> --- a/orte/mca/rml/oob/rml_oob_component.c
> >>> +++ b/orte/mca/rml/oob/rml_oob_component.c
> >>> @@ -185,8 +185,7 @@ orte_rml_oob_ft_event(int state) {
> >>>;
> >>>}
> >>>
> >>> -if( ORTE_SUCCESS !=
> >>> -(ret = orte_oob.ft_event(state)) ) {
> >>> +if( ORTE_SUCCESS != (ret = orte_rml_oob_ft_event(state)) ) {
> >>>ORTE_ERROR_LOG(ret);
> >>>exit_status = ret;
> >>>goto cleanup;
> >>>
> >>>
> >>>
> >>> This is, of course, wrong. Now the function calls itself in a loop
> until
> >>> it crashes. Looking at orte/mca/oob there is still a ft_event()
> >>> function, but it is disabled using "#if 0". Looking at other functions
> >>> it seems I would need to create something like
> >>>
> >>> #define ORTE_OOB_FT_EVENT(m)
> >>>
> >>> Looking at the modules in orte/mca/oob/ it seems ft_event is
> implemented
> >>> in some places but it never seems to have any real functionality. Is
> >>> ft_event() actually needed there?
> >>>
> >>> Adrian
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > ___
> > devel mailing list
> > de.

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
How is this a problem in 1.7? We don't have a .ompi_ignore in
1.7.4. That is there to prevent mtt failures while I fix some
outstanding bcol issues.

I will clean this up on trunk and add it to the cmr.

-Nathan

On Thu, Feb 06, 2014 at 08:42:27PM -0800, Ralph Castain wrote:
> As many of you will have noticed, I have been struggling most of the evening 
> with breakage on the trunk. This was initiated by adding .ompi_ignore to the 
> coll/ml component, but the root cause of the problem is a blatant disregard 
> for OMPI design rules in the bcol framework. Component-level headers from the 
> coll/ml area have been included in multiple places throughout the bcol 
> framework, making it impossible to separate these two areas.
> 
> Unfortunately, this problem has now been propagated to the 1.7 branch. As 
> release manager, I'm afraid that places me in a difficult position, and I'm 
> going to have to insist that this either is fixed immediately (i.e., in next 
> 24 hours), or I have to rescind/delete that area from the 1.7 branch and 
> release an immediate 1.7.5 (with attendant apologies to the community for the 
> screwup). We will then proceed with our intended plan, minus the bcol code.
> 
> I'd appreciate someone letting me know if this problem (a) can even be fixed, 
> given the degree of cross-connection I see in the bcol code, and (b) if it 
> can, then by when.
> 
> Thanks
> Ralph
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


pgpt0PXN3LoLE.pgp
Description: PGP signature


Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Ralph Castain
The issue in 1.7 is all the cross-integration, which means we violate our 
normal behavior when it comes to no-building and user-directed component 
selection. Jeff and I just discussed how this could be resolved using the 
PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't clear 
to me how hard it will be to do, and when it might be ready.

However, we don't have the problem of incorrect results that we do in the 
trunk, so we do have a little more latitude.

So the situation with respect to 1.7 is pretty clear: if we can get a PML-BTL 
model in place within the next week, then we can let things continue as-is. If 
we can't, then we remove the coll/ml component and the bcol framework from 1.7, 
leaving the door open to reinstatement whenever the code is actually ready.


On Feb 7, 2014, at 7:37 AM, Nathan Hjelm  wrote:

> How is this a problem in 1.7? We don't have a .ompi_ignore in
> 1.7.4. That is there to prevent mtt failures while I fix some
> outstanding bcol issues.
> 
> I will clean this up on trunk and add it to the cmr.
> 
> -Nathan
> 
> On Thu, Feb 06, 2014 at 08:42:27PM -0800, Ralph Castain wrote:
>> As many of you will have noticed, I have been struggling most of the evening 
>> with breakage on the trunk. This was initiated by adding .ompi_ignore to the 
>> coll/ml component, but the root cause of the problem is a blatant disregard 
>> for OMPI design rules in the bcol framework. Component-level headers from 
>> the coll/ml area have been included in multiple places throughout the bcol 
>> framework, making it impossible to separate these two areas.
>> 
>> Unfortunately, this problem has now been propagated to the 1.7 branch. As 
>> release manager, I'm afraid that places me in a difficult position, and I'm 
>> going to have to insist that this either is fixed immediately (i.e., in next 
>> 24 hours), or I have to rescind/delete that area from the 1.7 branch and 
>> release an immediate 1.7.5 (with attendant apologies to the community for 
>> the screwup). We will then proceed with our intended plan, minus the bcol 
>> code.
>> 
>> I'd appreciate someone letting me know if this problem (a) can even be 
>> fixed, given the degree of cross-connection I see in the bcol code, and (b) 
>> if it can, then by when.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:
> The issue in 1.7 is all the cross-integration, which means we violate our 
> normal behavior when it comes to no-building and user-directed component 
> selection. Jeff and I just discussed how this could be resolved using the 
> PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't 
> clear to me how hard it will be to do, and when it might be ready.
> 
> However, we don't have the problem of incorrect results that we do in the 
> trunk, so we do have a little more latitude.
> 
> So the situation with respect to 1.7 is pretty clear: if we can get a PML-BTL 
> model in place within the next week, then we can let things continue as-is. 
> If we can't, then we remove the coll/ml component and the bcol framework from 
> 1.7, leaving the door open to reinstatement whenever the code is actually 
> ready.

Should be ready today. The use of that coll/ml structure is unnecessary
at this time. I am removing it in bcol right now. In the future we will
put in a better fix but this should work for 1.7.x/1.8.x.

-Nathan


pgpFFAw6juova.pgp
Description: PGP signature


Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Ralph Castain

On Feb 7, 2014, at 7:52 AM, Nathan Hjelm  wrote:

> On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:
>> The issue in 1.7 is all the cross-integration, which means we violate our 
>> normal behavior when it comes to no-building and user-directed component 
>> selection. Jeff and I just discussed how this could be resolved using the 
>> PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't 
>> clear to me how hard it will be to do, and when it might be ready.
>> 
>> However, we don't have the problem of incorrect results that we do in the 
>> trunk, so we do have a little more latitude.
>> 
>> So the situation with respect to 1.7 is pretty clear: if we can get a 
>> PML-BTL model in place within the next week, then we can let things continue 
>> as-is. If we can't, then we remove the coll/ml component and the bcol 
>> framework from 1.7, leaving the door open to reinstatement whenever the code 
>> is actually ready.
> 
> Should be ready today. The use of that coll/ml structure is unnecessary
> at this time. I am removing it in bcol right now. In the future we will
> put in a better fix but this should work for 1.7.x/1.8.x.

Kewl - thanks!

> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Jeff Squyres (jsquyres)
On Feb 7, 2014, at 10:52 AM, Nathan Hjelm  wrote:

> Should be ready today. The use of that coll/ml structure is unnecessary
> at this time. I am removing it in bcol right now. In the future we will
> put in a better fix but this should work for 1.7.x/1.8.x.


Sweet.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] openmpi installation

2014-02-07 Thread Talla
Hello sir
I downloaded openmpi 1.7 and followed the installation instructions:
cd openmpi
./configure --prefix="/home/$USER/.openmpi"

make
make install
export PATH="$PATH:/home/$USER/.openmpi/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"

echo export PATH="$PATH:/home/$USER/.openmpi/bin" >> /home/$USER/.bashrc
echo export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
>> /home/$USER/.bashrc

No error messages appear, accept "nothing to do with all-em". However when
I run mpicc or mpirun it says no command. So I am not sure if the mpi
installed correctly in my red hat server or not. I don't know what I am
missing so I would really appreciate it if you help me as I am struggling
with this for a while.

Thank you in advance. Talla


-- 

*##*

*Dr. Jamal A TallaAssistant professorDepartment of Physics, **Rm 2139*


*College of Science,09 King Faisal UniversityP.O. Box 380, Al-Ahsaa** -
31982*
City Code: HOF

*Kingdom of Saudi ArabiaCell Phone: +966564542399*


Re: [OMPI devel] openmpi installation

2014-02-07 Thread Ralph Castain
Well, it certainly looks okay - try doing "ls" in your prefix directory. Do you 
see the bin and lib directories there? Anything in them?

On Feb 7, 2014, at 8:37 AM, Talla  wrote:

> Hello sir
> I downloaded openmpi 1.7 and followed the installation instructions:
> cd openmpi
> ./configure --prefix="/home/$USER/.openmpi"
> 
> make
> make install
> export PATH="$PATH:/home/$USER/.openmpi/bin"
> export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
> 
> echo export PATH="$PATH:/home/$USER/.openmpi/bin" >> /home/$USER/.bashrc
> echo export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
> >> /home/$USER/.bashrc
> 
> No error messages appear, accept "nothing to do with all-em". However when I 
> run mpicc or mpirun it says no command. So I am not sure if the mpi installed 
> correctly in my red hat server or not. I don't know what I am missing so I 
> would really appreciate it if you help me as I am struggling with this for a 
> while. 
>  
> Thank you in advance. Talla
> 
> 
> 
> -- 
> ##
> Dr. Jamal A Talla
> Assistant professor
> Department of Physics, Rm 2139
> College of Science,09 
> King Faisal University
> P.O. Box 380, Al-Ahsaa - 31982
> City Code: HOF
> Kingdom of Saudi Arabia
> Cell Phone: +966564542399
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] openmpi installation

2014-02-07 Thread Talla
Thank you for considering my case seriously.
yes sir both directories along with other directories are there with files
in them. But still I feel I am missing something not sure what it is. how I
can check Open Mpi? mpirun is not responding not even mpicc ? any
instruction how to run parallel jobs , examples with instruction any help
is highly appreciated.
Regards.


On Fri, Feb 7, 2014 at 7:42 PM, Ralph Castain  wrote:

> Well, it certainly looks okay - try doing "ls" in your prefix directory.
> Do you see the bin and lib directories there? Anything in them?
>
> On Feb 7, 2014, at 8:37 AM, Talla  wrote:
>
> Hello sir
> I downloaded openmpi 1.7 and followed the installation instructions:
> cd openmpi
> ./configure --prefix="/home/$USER/.openmpi"
>
> make
> make install
> export PATH="$PATH:/home/$USER/.openmpi/bin"
> export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
>
> echo export PATH="$PATH:/home/$USER/.openmpi/bin" >> /home/$USER/.bashrc
> echo export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
> >> /home/$USER/.bashrc
>
> No error messages appear, accept "nothing to do with all-em". However when
> I run mpicc or mpirun it says no command. So I am not sure if the mpi
> installed correctly in my red hat server or not. I don't know what I am
> missing so I would really appreciate it if you help me as I am struggling
> with this for a while.
>
> Thank you in advance. Talla
>
>
> --
>
> *## *
>
> *Dr. Jamal A TallaAssistant professorDepartment of Physics, **Rm 2139*
>
>
> *College of Science,09 King Faisal UniversityP.O. Box 380, Al-Ahsaa** -
> 31982*
> City Code: HOF
>
> *Kingdom of Saudi ArabiaCell Phone: +966564542399 <%2B966564542399>*
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 

*##*

*Dr. Jamal A TallaAssistant professorDepartment of Physics, **Rm 2139*


*College of Science,09 King Faisal UniversityP.O. Box 380, Al-Ahsaa** -
31982*
City Code: HOF

*Kingdom of Saudi ArabiaCell Phone: +966564542399*


Re: [OMPI devel] openmpi installation

2014-02-07 Thread Ralph Castain
If the directories are there and populated, then the problem is likely with 
your path. Do this:

1. "which mpirun" - if you don't see your /bin, then you know your path 
is wrong

2. "printenv PATH" - is it what you expected?

We generally suggest that you put your /bin and /lib at the 
beginning of their respective envars as most OS distributions come with their 
own versions, and you want to be sure and pickup your installed version first.


On Feb 7, 2014, at 8:54 AM, Talla  wrote:

> Thank you for considering my case seriously. 
> yes sir both directories along with other directories are there with files in 
> them. But still I feel I am missing something not sure what it is. how I can 
> check Open Mpi? mpirun is not responding not even mpicc ? any instruction how 
> to run parallel jobs , examples with instruction any help is highly 
> appreciated.
> Regards. 
> 
> 
> On Fri, Feb 7, 2014 at 7:42 PM, Ralph Castain  wrote:
> Well, it certainly looks okay - try doing "ls" in your prefix directory. Do 
> you see the bin and lib directories there? Anything in them?
> 
> On Feb 7, 2014, at 8:37 AM, Talla  wrote:
> 
>> Hello sir
>> I downloaded openmpi 1.7 and followed the installation instructions:
>> cd openmpi
>> ./configure --prefix="/home/$USER/.openmpi"
>> 
>> make
>> make install
>> export PATH="$PATH:/home/$USER/.openmpi/bin"
>> export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
>> 
>> echo export PATH="$PATH:/home/$USER/.openmpi/bin" >> /home/$USER/.bashrc
>> echo export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/"
>> >> /home/$USER/.bashrc
>> 
>> No error messages appear, accept "nothing to do with all-em". However when I 
>> run mpicc or mpirun it says no command. So I am not sure if the mpi 
>> installed correctly in my red hat server or not. I don't know what I am 
>> missing so I would really appreciate it if you help me as I am struggling 
>> with this for a while. 
>>  
>> Thank you in advance. Talla
>> 
>> 
>> 
>> -- 
>> ##
>> Dr. Jamal A Talla
>> Assistant professor
>> Department of Physics, Rm 2139
>> College of Science,09 
>> King Faisal University
>> P.O. Box 380, Al-Ahsaa - 31982
>> City Code: HOF
>> Kingdom of Saudi Arabia
>> Cell Phone: +966564542399
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> ##
> Dr. Jamal A Talla
> Assistant professor
> Department of Physics, Rm 2139
> College of Science,09 
> King Faisal University
> P.O. Box 380, Al-Ahsaa - 31982
> City Code: HOF
> Kingdom of Saudi Arabia
> Cell Phone: +966564542399
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Shamis, Pavel
Can you please give a try to the attached hot-fix.
It unrolls most of the spaghetti, except the iboffload component (which is 
anyway disabled).
Sorry for the mess.

Best,
Pasha

On Feb 7, 2014, at 10:52 AM, Nathan Hjelm 
mailto:hje...@lanl.gov>> wrote:

On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:
The issue in 1.7 is all the cross-integration, which means we violate our 
normal behavior when it comes to no-building and user-directed component 
selection. Jeff and I just discussed how this could be resolved using the 
PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't clear 
to me how hard it will be to do, and when it might be ready.

However, we don't have the problem of incorrect results that we do in the 
trunk, so we do have a little more latitude.

So the situation with respect to 1.7 is pretty clear: if we can get a PML-BTL 
model in place within the next week, then we can let things continue as-is. If 
we can't, then we remove the coll/ml component and the bcol framework from 1.7, 
leaving the door open to reinstatement whenever the code is actually ready.

Should be ready today. The use of that coll/ml structure is unnecessary
at this time. I am removing it in bcol right now. In the future we will
put in a better fix but this should work for 1.7.x/1.8.x.

-Nathan
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


p4.patch
Description: p4.patch


Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
Hah. You beet me to it. More or less identical to what I was doing. I
will give this a try. If this works we should push it and add it to the
coll/ml cmr.

-Nathan

On Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel wrote:
> Can you please give a try to the attached hot-fix.
> It unrolls most of the spaghetti, except the iboffload component (which is 
> anyway disabled).
> Sorry for the mess.
> 
> Best,
> Pasha
> 
> On Feb 7, 2014, at 10:52 AM, Nathan Hjelm 
> mailto:hje...@lanl.gov>> wrote:
> 
> On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:
> The issue in 1.7 is all the cross-integration, which means we violate our 
> normal behavior when it comes to no-building and user-directed component 
> selection. Jeff and I just discussed how this could be resolved using the 
> PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't 
> clear to me how hard it will be to do, and when it might be ready.
> 
> However, we don't have the problem of incorrect results that we do in the 
> trunk, so we do have a little more latitude.
> 
> So the situation with respect to 1.7 is pretty clear: if we can get a PML-BTL 
> model in place within the next week, then we can let things continue as-is. 
> If we can't, then we remove the coll/ml component and the bcol framework from 
> 1.7, leaving the door open to reinstatement whenever the code is actually 
> ready.
> 
> Should be ready today. The use of that coll/ml structure is unnecessary
> at this time. I am removing it in bcol right now. In the future we will
> put in a better fix but this should work for 1.7.x/1.8.x.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgp1wOrMU0GTu.pgp
Description: PGP signature


Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Nathan Hjelm
Can you gzip the patch. The local exchange server has a habit of
converting LF to CRLF.

-Nathan

On Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel wrote:
> Can you please give a try to the attached hot-fix.
> It unrolls most of the spaghetti, except the iboffload component (which is 
> anyway disabled).
> Sorry for the mess.
> 
> Best,
> Pasha
> 
> On Feb 7, 2014, at 10:52 AM, Nathan Hjelm 
> mailto:hje...@lanl.gov>> wrote:
> 
> On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:
> The issue in 1.7 is all the cross-integration, which means we violate our 
> normal behavior when it comes to no-building and user-directed component 
> selection. Jeff and I just discussed how this could be resolved using the 
> PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't 
> clear to me how hard it will be to do, and when it might be ready.
> 
> However, we don't have the problem of incorrect results that we do in the 
> trunk, so we do have a little more latitude.
> 
> So the situation with respect to 1.7 is pretty clear: if we can get a PML-BTL 
> model in place within the next week, then we can let things continue as-is. 
> If we can't, then we remove the coll/ml component and the bcol framework from 
> 1.7, leaving the door open to reinstatement whenever the code is actually 
> ready.
> 
> Should be ready today. The use of that coll/ml structure is unnecessary
> at this time. I am removing it in bcol right now. In the future we will
> put in a better fix but this should work for 1.7.x/1.8.x.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgpxh5K8JSWFT.pgp
Description: PGP signature


Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Shamis, Pavel
Exchange is evil….
Attached.

Best,
P




p4.patch.gz
Description: p4.patch.gz


On Feb 7, 2014, at 12:41 PM, Nathan Hjelm  wrote:Can you gzip the patch. The local exchange server has a habit ofconverting LF to CRLF.-NathanOn Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel wrote:Can you please give a try to the attached hot-fix.It unrolls most of the spaghetti, except the iboffload component (which is anyway disabled).Sorry for the mess.Best,PashaOn Feb 7, 2014, at 10:52 AM, Nathan Hjelm > wrote:On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:The issue in 1.7 is all the cross-integration, which means we violate our normal behavior when it comes to no-building and user-directed component selection. Jeff and I just discussed how this could be resolved using the PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't clear to me how hard it will be to do, and when it might be ready.However, we don't have the problem of incorrect results that we do in the trunk, so we do have a little more latitude.So the situation with respect to 1.7 is pretty clear: if we can get a PML-BTL model in place within the next week, then we can let things continue as-is. If we can't, then we remove the coll/ml component and the bcol framework from 1.7, leaving the door open to reinstatement whenever the code is actually ready.Should be ready today. The use of that coll/ml structure is unnecessaryat this time. I am removing it in bcol right now. In the future we willput in a better fix but this should work for 1.7.x/1.8.x.-Nathan___devel mailing listde...@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel___devel mailing listde...@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel___devel mailing listde...@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Bcol/mcol violations

2014-02-07 Thread Ralph Castain
Thank you guysdeeply appreciate your quick resolution to this problem


On Feb 7, 2014, at 10:10 AM, Shamis, Pavel  wrote:

> Exchange is evil….
> Attached.
> 
> Best,
> P
> 
> 
> 
> 
> 
> 
> On Feb 7, 2014, at 12:41 PM, Nathan Hjelm  wrote:
> 
>> Can you gzip the patch. The local exchange server has a habit of
>> converting LF to CRLF.
>> 
>> -Nathan
>> 
>> On Fri, Feb 07, 2014 at 12:14:02PM -0500, Shamis, Pavel wrote:
>>> Can you please give a try to the attached hot-fix.
>>> It unrolls most of the spaghetti, except the iboffload component (which is 
>>> anyway disabled).
>>> Sorry for the mess.
>>> 
>>> Best,
>>> Pasha
>>> 
>>> On Feb 7, 2014, at 10:52 AM, Nathan Hjelm 
>>> mailto:hje...@lanl.gov>> wrote:
>>> 
>>> On Fri, Feb 07, 2014 at 07:46:03AM -0800, Ralph Castain wrote:
>>> The issue in 1.7 is all the cross-integration, which means we violate our 
>>> normal behavior when it comes to no-building and user-directed component 
>>> selection. Jeff and I just discussed how this could be resolved using the 
>>> PML-BTL model, but (a) that is not what we have in 1.7, and (b) it isn't 
>>> clear to me how hard it will be to do, and when it might be ready.
>>> 
>>> However, we don't have the problem of incorrect results that we do in the 
>>> trunk, so we do have a little more latitude.
>>> 
>>> So the situation with respect to 1.7 is pretty clear: if we can get a 
>>> PML-BTL model in place within the next week, then we can let things 
>>> continue as-is. If we can't, then we remove the coll/ml component and the 
>>> bcol framework from 1.7, leaving the door open to reinstatement whenever 
>>> the code is actually ready.
>>> 
>>> Should be ready today. The use of that coll/ml structure is unnecessary
>>> at this time. I am removing it in bcol right now. In the future we will
>>> put in a better fix but this should work for 1.7.x/1.8.x.
>>> 
>>> -Nathan
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] new CRS component added (criu)

2014-02-07 Thread Adrian Reber
I have created a new CRS component using criu (criu.org) to support
checkpoint/restart in Open MPI. My current patch only provides the
framework and necessary configure scripts to detect and link against
criu. With this patch orte-checkpoint can request a checkpoint and the
new CRIU CRS component is used:

[dcbz:13766] orte_cr: init: orte_cr_init()
[dcbz:13766] crs:criu: opal_crs_criu_prelaunch
[dcbz:13766] crs:criu: opal_crs_criu_prelaunch
[dcbz:13771] opal_cr: init: Verbose Level: 30
[dcbz:13771] opal_cr: init: FT Enabled: true
[dcbz:13771] opal_cr: init: Is a tool program: false
[dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False)
[dcbz:13771] opal_cr: init: Checkpoint Signal: 10
[dcbz:13771] opal_cr: init: FT Use thread: true
[dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100
[dcbz:13771] opal_cr: init: C/R Debugging Enabled [False]
[dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20
[dcbz:13771] opal_cr: init: Temp Directory: /tmp
...
[dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint)
[dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt()
[dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
[dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
[dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
[dcbz:13772] crs:criu: checkpoint(13772, ---)
[dcbz:13772] crs:criu: criu_init_opts() returned 0
[dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
[dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
[dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
[dcbz:13771] crs:criu: checkpoint(13771, ---)
[dcbz:13771] crs:criu: criu_init_opts() returned 0
...
[dcbz:13766] 13766: Checkpoint established for process [55729,0].
[dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running)
[dcbz:13771] orte_cr: coord: orte_cr_coord(Running)
[dcbz:13766] 13766: Successfully restarted process [55729,0].
[dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running)
[dcbz:13772] orte_cr: coord: orte_cr_coord(Running)

It seems the C/R code basically works again and now needs to be filled
with the actual code to take checkpoints using criu.

The patch I want to check in is available at:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea

The patch only creates files in opal/mca/crs/criu and does not touch any
other code.

Adrian


[OMPI devel] Update on 1.7.5

2014-02-07 Thread Ralph Castain
Hi folks

As you may have noticed, I've been working my way thru the CMR backlog on 
1.7.5. A large percentage of them were minor fixes (valgrind warning 
suppressions, error message typos, etc.), so those went in the first round. 
Today's round contains more "meaty" things, but I still consider them fairly 
low risk as the code coverage impacted is contained.

I'm going to let this run thru tonight's MTT - if things look okay tomorrow, I 
will roll the OSHMEM cmr into 1.7.5 over the weekend. This is quite likely to 
destabilize the branch, so I expect to see breakage in the resulting MTT 
reports. We'll deal with it as we go.

Beyond that, there are still about a dozen CMRs in the system awaiting review. 
Jeff has the majority, followed by Nathan. If folks could please review them 
early next week, I would appreciate it.

Thanks
Ralph



[OMPI devel] RFC: optimize probe in ob1

2014-02-07 Thread Nathan Hjelm
What: The current probe algorithm in ob1 is linear with respect to the
number or processes in the job. I wish to change the algorithm to be
linear in the number of processes with unexpected messages. To do this I
added an additional opal_list_t to the ob1 communicator and made the ob1
process a list_item_t. When an unexpected message comes in on a proc it
is added to that proc's unexpected message queue and the proc is added
to the communicator's list of procs with unexpected messages
(unexpected_procs) if it isn't already on that list. When matching a
probe request this list is used to determine which procs to look at to
find an unexpected message. The new list is protected by the matching
lock so no extra locking is needed.

Why: I have a benchmark that makes heavy use of MPI_Iprobe in one if its
phases. I discovered that the primary reason this benchmark was running
slow with Open MPI is the probe algorithm.

When: This is another simple optimization. It only affects the
unexpected message path and will speed up probe requests. This is
intended to go into 1.7.5. Setting the timeout to next Tuesday (which
gives me time to verify the improvment at scale-- 131,000 PEs).

See the attached patch.

-Nathan

diff --git a/ompi/mca/pml/ob1/pml_ob1.c b/ompi/mca/pml/ob1/pml_ob1.c
index bfb975a..4480067 100644
--- a/ompi/mca/pml/ob1/pml_ob1.c
+++ b/ompi/mca/pml/ob1/pml_ob1.c
@@ -192,8 +192,7 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
 {
 /* allocate pml specific comm data */
 mca_pml_ob1_comm_t* pml_comm = OBJ_NEW(mca_pml_ob1_comm_t);
-opal_list_item_t *item, *next_item;
-mca_pml_ob1_recv_frag_t* frag;
+mca_pml_ob1_recv_frag_t* frag, *next_frag;
 mca_pml_ob1_comm_proc_t* pml_proc;
 mca_pml_ob1_match_hdr_t* hdr;
 int i;
@@ -215,12 +214,9 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
 pml_comm->procs[i].ompi_proc = 
ompi_group_peer_lookup(comm->c_remote_group,i);
 OBJ_RETAIN(pml_comm->procs[i].ompi_proc);
 }
+
 /* Grab all related messages from the non_existing_communicator pending 
queue */
-for( item = 
opal_list_get_first(&mca_pml_ob1.non_existing_communicator_pending);
- item != 
opal_list_get_end(&mca_pml_ob1.non_existing_communicator_pending);
- item = next_item ) {
-frag = (mca_pml_ob1_recv_frag_t*)item;
-next_item = opal_list_get_next(item);
+OPAL_LIST_FOREACH_SAFE(frag, next_frag, 
&mca_pml_ob1.non_existing_communicator_pending, mca_pml_ob1_recv_frag_t) {
 hdr = &frag->hdr.hdr_match;
 
 /* Is this fragment for the current communicator ? */
@@ -231,9 +227,7 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
  * we should remove it from the
  * non_existing_communicator_pending list. */
 opal_list_remove_item( &mca_pml_ob1.non_existing_communicator_pending, 
-   item );
-
-  add_fragment_to_unexpected:
+   (opal_list_item_t *) frag);
 
 /* We generate the MSG_ARRIVED event as soon as the PML is aware
  * of a matching fragment arrival. Independing if it is received
@@ -252,9 +246,15 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
  */
 pml_proc = &(pml_comm->procs[hdr->hdr_src]);
 
+  add_fragment_to_unexpected:
+
 if( ((uint16_t)hdr->hdr_seq) == 
((uint16_t)pml_proc->expected_sequence) ) {
 /* We're now expecting the next sequence number. */
 pml_proc->expected_sequence++;
+/* add this proc to the list of procs with unexpected messages */
+if (0 == opal_list_get_size (&pml_proc->unexpected_frags)) {
+opal_list_append (&pml_comm->unexpected_procs, 
&pml_proc->super);
+}
 opal_list_append( &pml_proc->unexpected_frags, 
(opal_list_item_t*)frag );
 PERUSE_TRACE_MSG_EVENT(PERUSE_COMM_MSG_INSERT_IN_UNEX_Q, comm,
hdr->hdr_src, hdr->hdr_tag, PERUSE_RECV);
@@ -264,9 +264,7 @@ int mca_pml_ob1_add_comm(ompi_communicator_t* comm)
  * situation as the cant_match is only checked when a new fragment 
is received from
  * the network.
  */
-   for(frag = (mca_pml_ob1_recv_frag_t 
*)opal_list_get_first(&pml_proc->frags_cant_match);
-   frag != (mca_pml_ob1_recv_frag_t 
*)opal_list_get_end(&pml_proc->frags_cant_match);
-   frag = (mca_pml_ob1_recv_frag_t *)opal_list_get_next(frag)) {
+OPAL_LIST_FOREACH(frag, &pml_proc->frags_cant_match, 
mca_pml_ob1_recv_frag_t) {
hdr = &frag->hdr.hdr_match;
/* If the message has the next expected seq from that proc...  
*/
if(hdr->hdr_seq != pml_proc->expected_sequence)
diff --git a/ompi/mca/pml/ob1/pml_ob1.h b/ompi/mca/pml/ob1/pml_ob1.h
index 5c66580..1a9dd78 100644
--- a/ompi/mca/pml/ob1/pml_ob1.h
+++ b/ompi/mca/pml/ob1/pml_ob1.h
@@ -12,7 +12,7 @@
  *  

Re: [OMPI devel] new CRS component added (criu)

2014-02-07 Thread Jeff Squyres (jsquyres)
Sweet -- +1 for CRIU support!

FWIW, I see you modeled your configure.m4 off the blcr configure.m4, but I'd 
actually go with making it a bit simpler.  For example, I typically structure 
my configure.m4's like this (typed in mail client -- forgive mistakes...):

-
   AS_IF([...some test], [crs_criu_happy=1], [crs_criu_happy=0])
   # Only bother doing the next test if the previous one passed
   AS_IF([test $crs_criu_happy -eq 1 && ...next test], 
 [crs_criu_happy=1], [crs_criu_happy=0])
   # Only bother doing the next test if the previous one passed
   AS_IF([test $crs_criu_happy -eq 1 && ...next test], 
 [crs_criu_happy=1], [crs_criu_happy=0])

   ...etc...

   # Put a single execution of $2 and $3 at the end, depending on how the 
   # above tests go.  If a human asked for criu (e.g., --with-criu) and
   # we can't find criu support, that's a fatal error.
   AS_IF([test $crs_criu_happy -eq 1],
 [$2],
 [AS_IF([test "$with_criu" != "x" && "x$with_criu" != "xno"],
[AC_MSG_WARN([You asked for CRIU support, but I can't find it.])
 AC_MSG_ERROR([Cannot continue])],
[$1])
  ])
-

I note you have a stray $3 at the end of your configure.m4, too (it might 
supposed to be $2?).

Finally, I note you're looking for libcriu.  Last time I checked with the CRIU 
guys -- which was quite a while ago -- that didn't exist (but I put in my $0.02 
that OMPI would like to see such a userspace library).  I take it that libcriu 
now exists?





On Feb 7, 2014, at 4:46 PM, Adrian Reber  wrote:

> I have created a new CRS component using criu (criu.org) to support
> checkpoint/restart in Open MPI. My current patch only provides the
> framework and necessary configure scripts to detect and link against
> criu. With this patch orte-checkpoint can request a checkpoint and the
> new CRIU CRS component is used:
> 
> [dcbz:13766] orte_cr: init: orte_cr_init()
> [dcbz:13766] crs:criu: opal_crs_criu_prelaunch
> [dcbz:13766] crs:criu: opal_crs_criu_prelaunch
> [dcbz:13771] opal_cr: init: Verbose Level: 30
> [dcbz:13771] opal_cr: init: FT Enabled: true
> [dcbz:13771] opal_cr: init: Is a tool program: false
> [dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False)
> [dcbz:13771] opal_cr: init: Checkpoint Signal: 10
> [dcbz:13771] opal_cr: init: FT Use thread: true
> [dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100
> [dcbz:13771] opal_cr: init: C/R Debugging Enabled [False]
> [dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20
> [dcbz:13771] opal_cr: init: Temp Directory: /tmp
> ...
> [dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint)
> [dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt()
> [dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
> [dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
> [dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
> [dcbz:13772] crs:criu: checkpoint(13772, ---)
> [dcbz:13772] crs:criu: criu_init_opts() returned 0
> [dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
> [dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
> [dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
> [dcbz:13771] crs:criu: checkpoint(13771, ---)
> [dcbz:13771] crs:criu: criu_init_opts() returned 0
> ...
> [dcbz:13766] 13766: Checkpoint established for process [55729,0].
> [dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running)
> [dcbz:13771] orte_cr: coord: orte_cr_coord(Running)
> [dcbz:13766] 13766: Successfully restarted process [55729,0].
> [dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running)
> [dcbz:13772] orte_cr: coord: orte_cr_coord(Running)
> 
> It seems the C/R code basically works again and now needs to be filled
> with the actual code to take checkpoints using criu.
> 
> The patch I want to check in is available at:
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea
> 
> The patch only creates files in opal/mca/crs/criu and does not touch any
> other code.
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Update on 1.7.5

2014-02-07 Thread Paul Hargrove
Ralph,

I'll try to test tonight's v1.7 taball for:
+ ia64 atomics (#4174)
+ bad getpwuid (#4164)
+ opalpath_nfs/EPERM (#4125)
+ torque smp (#4227)

All but torque are fully-automated tests and I need only check my email for
the results.
The torque one will require manual job submission.

-Paul


On Fri, Feb 7, 2014 at 1:55 PM, Ralph Castain  wrote:

> Hi folks
>
> As you may have noticed, I've been working my way thru the CMR backlog on
> 1.7.5. A large percentage of them were minor fixes (valgrind warning
> suppressions, error message typos, etc.), so those went in the first round.
> Today's round contains more "meaty" things, but I still consider them
> fairly low risk as the code coverage impacted is contained.
>
> I'm going to let this run thru tonight's MTT - if things look okay
> tomorrow, I will roll the OSHMEM cmr into 1.7.5 over the weekend. This is
> quite likely to destabilize the branch, so I expect to see breakage in the
> resulting MTT reports. We'll deal with it as we go.
>
> Beyond that, there are still about a dozen CMRs in the system awaiting
> review. Jeff has the majority, followed by Nathan. If folks could please
> review them early next week, I would appreciate it.
>
> Thanks
> Ralph
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Joshua Ladd
What: Add an internal random number generator to OPAL.

Why: OMPI uses rand and srand all over the place. Because the middleware is 
mucking with the RNG's global state, applications that use these library 
routines will not achieve reproducible results with the same seed.

How: I plan to put in an additive lagged Fibonacci generator seeded with a 
Tausworthe generator that itself is seeded by the user's seed. The short story 
here is that the ALFG has a toroidal state space, i.e. it can be decomposed 
into non-overlapping cycles with maximal period. It's well understood how to 
fully enumerate these cycles when, for a length k register composed of m-bit 
words, we view this as a k X m binary matrix. It was proven by Marsaglia et al. 
that this matrix has a canonical form that is uniquely determined by the values 
k and l (the two numbers that (almost) completely characterize an ALFG.) So, 
distinct seeds are guaranteed to map to distinct, non-overlapping, long period 
streams that have measurably very, very low inter- and intra-stream 
correlations.  We used this for large scale Monte Carlo simulations back in my 
PhD days.

Will define a new type:

struct opal_rng_buffer_t {

uint32_t  buff[127]; /* if people are going to pitch a fit over the size, we 
can go smaller, down to 7, but, obviously, this affects the quality of the 
streams */
int tap1;
int tap2;

};

and two functions:

/* User is responsible for defining his/her own opal_rng_buffer_t
  * or malloc-ing and managing the resources themselves.
 */
int opal_srand(opal_rng_buffer_t *buff, uint32_t seed);

/* Returns a 32-bit pseudo random integer */
uint32_t opal_rand(opal_rng_buffer_t *buff);


When: Should be in by the end of February.  Code is written, but integration 
and testing always takes some time.



Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898




Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Paul Hargrove
Joshua,

This is for ticket #2928, right?

-Paul


On Fri, Feb 7, 2014 at 2:23 PM, Joshua Ladd  wrote:

>  What: Add an internal random number generator to OPAL.
>
>
>
> Why: OMPI uses rand and srand all over the place. Because the middleware
> is mucking with the RNG's global state, applications that use these library
> routines will not achieve reproducible results with the same seed.
>
>
>
> How: I plan to put in an additive lagged Fibonacci generator seeded with a
> Tausworthe generator that itself is seeded by the user's seed. The short
> story here is that the ALFG has a toroidal state space, i.e. it can be
> decomposed into non-overlapping cycles with maximal period. It's well
> understood how to fully enumerate these cycles when, for a length k
> register composed of m-bit words, we view this as a k X m binary matrix. It
> was proven by Marsaglia et al. that this matrix has a canonical form that
> is uniquely determined by the values k and l (the two numbers that (almost)
> completely characterize an ALFG.) So, distinct seeds are guaranteed to map
> to distinct, non-overlapping, long period streams that have measurably
> very, very low inter- and intra-stream correlations.  We used this for
> large scale Monte Carlo simulations back in my PhD days.
>
>
>
> Will define a new type:
>
>
>
> struct opal_rng_buffer_t {
>
>
>
> uint32_t  buff[127]; /* if people are going to pitch a fit over the size,
> we can go smaller, down to 7, but, obviously, this affects the quality of
> the streams */
>
> int tap1;
>
> int tap2;
>
>
>
> };
>
>
>
> and two functions:
>
>
>
> /* User is responsible for defining his/her own opal_rng_buffer_t
>
>   * or malloc-ing and managing the resources themselves.
>
>  */
>
> int opal_srand(opal_rng_buffer_t *buff, uint32_t seed);
>
>
>
> /* Returns a 32-bit pseudo random integer */
>
> uint32_t opal_rand(opal_rng_buffer_t *buff);
>
>
>
>
>
> When: Should be in by the end of February.  Code is written, but
> integration and testing always takes some time.
>
>
>
>
>
>
>
> Joshua S. Ladd, PhD
>
> HPC Algorithms Engineer
>
> Mellanox Technologies
>
>
>
> Email: josh...@mellanox.com
>
> Cell: +1 (865) 258 - 8898
>
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Joshua Ladd
Yes. After batting this around a bit with Jeff and Mike, we came to the 
consensus that the interface should be more "rand_r", so that state is locally 
managed by the consumer. The ALFG offers a powerful yet simple way to do it. We 
may even expose it to users since it offers a very scalable and high quality 
parallel RNG.

Josh

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Hargrove
Sent: Friday, February 07, 2014 5:31 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] RFC: Add an OPAL rand and srand

Joshua,

This is for ticket #2928, right?

-Paul

On Fri, Feb 7, 2014 at 2:23 PM, Joshua Ladd 
mailto:josh...@mellanox.com>> wrote:
What: Add an internal random number generator to OPAL.

Why: OMPI uses rand and srand all over the place. Because the middleware is 
mucking with the RNG's global state, applications that use these library 
routines will not achieve reproducible results with the same seed.

How: I plan to put in an additive lagged Fibonacci generator seeded with a 
Tausworthe generator that itself is seeded by the user's seed. The short story 
here is that the ALFG has a toroidal state space, i.e. it can be decomposed 
into non-overlapping cycles with maximal period. It's well understood how to 
fully enumerate these cycles when, for a length k register composed of m-bit 
words, we view this as a k X m binary matrix. It was proven by Marsaglia et al. 
that this matrix has a canonical form that is uniquely determined by the values 
k and l (the two numbers that (almost) completely characterize an ALFG.) So, 
distinct seeds are guaranteed to map to distinct, non-overlapping, long period 
streams that have measurably very, very low inter- and intra-stream 
correlations.  We used this for large scale Monte Carlo simulations back in my 
PhD days.

Will define a new type:

struct opal_rng_buffer_t {

uint32_t  buff[127]; /* if people are going to pitch a fit over the size, we 
can go smaller, down to 7, but, obviously, this affects the quality of the 
streams */
int tap1;
int tap2;

};

and two functions:

/* User is responsible for defining his/her own opal_rng_buffer_t
  * or malloc-ing and managing the resources themselves.
 */
int opal_srand(opal_rng_buffer_t *buff, uint32_t seed);

/* Returns a 32-bit pseudo random integer */
uint32_t opal_rand(opal_rng_buffer_t *buff);


When: Should be in by the end of February.  Code is written, but integration 
and testing always takes some time.



Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Paul H. Hargrove  
phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] new CRS component added (criu)

2014-02-07 Thread Josh Hursey
That is fantastic! Thanks for the hard work so far getting the C/R
infrastructure back in place.


On Fri, Feb 7, 2014 at 3:46 PM, Adrian Reber  wrote:

> I have created a new CRS component using criu (criu.org) to support
> checkpoint/restart in Open MPI. My current patch only provides the
> framework and necessary configure scripts to detect and link against
> criu. With this patch orte-checkpoint can request a checkpoint and the
> new CRIU CRS component is used:
>
> [dcbz:13766] orte_cr: init: orte_cr_init()
> [dcbz:13766] crs:criu: opal_crs_criu_prelaunch
> [dcbz:13766] crs:criu: opal_crs_criu_prelaunch
> [dcbz:13771] opal_cr: init: Verbose Level: 30
> [dcbz:13771] opal_cr: init: FT Enabled: true
> [dcbz:13771] opal_cr: init: Is a tool program: false
> [dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False)
> [dcbz:13771] opal_cr: init: Checkpoint Signal: 10
> [dcbz:13771] opal_cr: init: FT Use thread: true
> [dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100
> [dcbz:13771] opal_cr: init: C/R Debugging Enabled [False]
> [dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20
> [dcbz:13771] opal_cr: init: Temp Directory: /tmp
> ...
> [dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint)
> [dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt()
> [dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
> [dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
> [dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
> [dcbz:13772] crs:criu: checkpoint(13772, ---)
> [dcbz:13772] crs:criu: criu_init_opts() returned 0
> [dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt()
> [dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt()
> [dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint.
> [dcbz:13771] crs:criu: checkpoint(13771, ---)
> [dcbz:13771] crs:criu: criu_init_opts() returned 0
> ...
> [dcbz:13766] 13766: Checkpoint established for process [55729,0].
> [dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running)
> [dcbz:13771] orte_cr: coord: orte_cr_coord(Running)
> [dcbz:13766] 13766: Successfully restarted process [55729,0].
> [dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running)
> [dcbz:13772] orte_cr: coord: orte_cr_coord(Running)
>
> It seems the C/R code basically works again and now needs to be filled
> with the actual code to take checkpoints using criu.
>
> The patch I want to check in is available at:
>
>
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea
>
> The patch only creates files in opal/mca/crs/criu and does not touch any
> other code.
>
> Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Nathan Hjelm
+1.

On Fri, Feb 07, 2014 at 10:23:41PM +, Joshua Ladd wrote:
>What: Add an internal random number generator to OPAL.
> 
> 
> 
>Why: OMPI uses rand and srand all over the place. Because the middleware
>is mucking with the RNG's global state, applications that use these
>library routines will not achieve reproducible results with the same seed.
> 
> 
> 
>How: I plan to put in an additive lagged Fibonacci generator seeded with a
>Tausworthe generator that itself is seeded by the user's seed. The short
>story here is that the ALFG has a toroidal state space, i.e. it can be
>decomposed into non-overlapping cycles with maximal period. It's well
>understood how to fully enumerate these cycles when, for a length k
>register composed of m-bit words, we view this as a k X m binary matrix.
>It was proven by Marsaglia et al. that this matrix has a canonical form
>that is uniquely determined by the values k and l (the two numbers that
>(almost) completely characterize an ALFG.) So, distinct seeds are
>guaranteed to map to distinct, non-overlapping, long period streams that
>have measurably very, very low inter- and intra-stream correlations.  We
>used this for large scale Monte Carlo simulations back in my PhD days.
> 
> 
> 
>Will define a new type:
> 
> 
> 
>struct opal_rng_buffer_t {
> 
> 
> 
>uint32_t  buff[127]; /* if people are going to pitch a fit over the size,
>we can go smaller, down to 7, but, obviously, this affects the quality of
>the streams */
> 
>int tap1;
> 
>int tap2;
> 
> 
> 
>};
> 
> 
> 
>and two functions:
> 
> 
> 
>/* User is responsible for defining his/her own opal_rng_buffer_t
> 
>  * or malloc-ing and managing the resources themselves.   
> 
> */
> 
>int opal_srand(opal_rng_buffer_t *buff, uint32_t seed);
> 
> 
> 
>/* Returns a 32-bit pseudo random integer */
> 
>uint32_t opal_rand(opal_rng_buffer_t *buff);
> 
> 
> 
> 
> 
>When: Should be in by the end of February.  Code is written, but
>integration and testing always takes some time.
> 
> 
> 
> 
> 
> 
> 
>Joshua S. Ladd, PhD
> 
>HPC Algorithms Engineer
> 
>Mellanox Technologies
> 
> 
> 
>Email: josh...@mellanox.com
> 
>Cell: +1 (865) 258 - 8898
> 
> 
> 
> 

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



pgpAHKMTGtNwS.pgp
Description: PGP signature


Re: [OMPI devel] RFC: Add an OPAL rand and srand

2014-02-07 Thread Jeff Squyres (jsquyres)
+1

On Feb 7, 2014, at 5:23 PM, Joshua Ladd 
 wrote:

> What: Add an internal random number generator to OPAL.
>  
> Why: OMPI uses rand and srand all over the place. Because the middleware is 
> mucking with the RNG’s global state, applications that use these library 
> routines will not achieve reproducible results with the same seed.
>  
> How: I plan to put in an additive lagged Fibonacci generator seeded with a 
> Tausworthe generator that itself is seeded by the user’s seed. The short 
> story here is that the ALFG has a toroidal state space, i.e. it can be 
> decomposed into non-overlapping cycles with maximal period. It’s well 
> understood how to fully enumerate these cycles when, for a length k register 
> composed of m-bit words, we view this as a k X m binary matrix. It was proven 
> by Marsaglia et al. that this matrix has a canonical form that is uniquely 
> determined by the values k and l (the two numbers that (almost) completely 
> characterize an ALFG.) So, distinct seeds are guaranteed to map to distinct, 
> non-overlapping, long period streams that have measurably very, very low 
> inter- and intra-stream correlations.  We used this for large scale Monte 
> Carlo simulations back in my PhD days.
>  
> Will define a new type:
>  
> struct opal_rng_buffer_t {
>  
> uint32_t  buff[127]; /* if people are going to pitch a fit over the size, we 
> can go smaller, down to 7, but, obviously, this affects the quality of the 
> streams */
> int tap1;
> int tap2;
>  
> };
>  
> and two functions:
>  
> /* User is responsible for defining his/her own opal_rng_buffer_t
>   * or malloc-ing and managing the resources themselves.   
>  */
> int opal_srand(opal_rng_buffer_t *buff, uint32_t seed);
>  
> /* Returns a 32-bit pseudo random integer */
> uint32_t opal_rand(opal_rng_buffer_t *buff);
>  
>  
> When: Should be in by the end of February.  Code is written, but integration 
> and testing always takes some time.
>  
>  
>  
> Joshua S. Ladd, PhD
> HPC Algorithms Engineer
> Mellanox Technologies
>  
> Email: josh...@mellanox.com
> Cell: +1 (865) 258 - 8898
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] new CRS component added (criu)

2014-02-07 Thread Jeff Squyres (jsquyres)
On Feb 7, 2014, at 5:08 PM, Jeff Squyres (jsquyres)  wrote:

>   AS_IF([test $crs_criu_happy -eq 1],
> [$2],
> [AS_IF([test "$with_criu" != "x" && "x$with_criu" != "xno"],
>[AC_MSG_WARN([You asked for CRIU support, but I can't find 
> it.])
> AC_MSG_ERROR([Cannot continue])],
>[$1])
>  ])

Heh.  I got $1 and $2 backwards.  But you get the idea.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/