Re: [OMPI devel] RFC: libevent update

2008-03-19 Thread Jeff Squyres
I re-merged down to the libevent-merge branch (to include r17872) and a new tarball has been uploaded to http://www.open-mpi.org/~jsquyres/unofficial/ On Mar 18, 2008, at 10:11 PM, George Bosilca wrote: Commit 17872 is the one you're looking for. https://svn.open-mpi.org/trac/ompi/changese

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread George Bosilca
Commit 17872 is the one you're looking for. https://svn.open-mpi.org/trac/ompi/changeset/17872 george. On Mar 18, 2008, at 9:12 PM, Jeff Squyres wrote: When did you fix it? I merged the trunk down to the libevent-merge branch late this afternoon (r17869). On Mar 18, 2008, at 7:29 PM, Georg

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Jeff Squyres
When did you fix it? I merged the trunk down to the libevent-merge branch late this afternoon (r17869). On Mar 18, 2008, at 7:29 PM, George Bosilca wrote: This has been fixed in the trunk, but not yet merged in the branch. george. On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote: I found

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Paul H. Hargrove
After taking a look at how epoll is implemented in the Linyux kernel, I can say with 100% certainty that BLCR will not restore the epoll fd correctly. I hope to fix that eventually, but have too many other things on my plate to address is now. Since I cannot promise how soon BLCR may be able to r

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread George Bosilca
This has been fixed in the trunk, but not yet merged in the branch. george. On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote: I found another problem with the libevent branch. If I set "-mca btl tcp,self" on the command line then I get a segfult when sending messages > 16 KB. I can try to mak

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Josh Hursey
I found another problem with the libevent branch. If I set "-mca btl tcp,self" on the command line then I get a segfult when sending messages > 16 KB. I can try to make a smaller repeater, but if you use the "progress" or "simple" tests in ompi-tests below: https://svn.open-mpi.org/svn/omp

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Josh Hursey
I have some more data from the field. Leaving "opal_event_include" unset (Default) BLCR would give me the following error when trying to restart a 2 process 'noop' MPI application: shell$ ompi-restart ompi_global_snapshot_8587.ckpt Restart failed: Bad file descri

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread George Bosilca
Its like rewriting libevent from scratch. I guess it can be done, but it will be a long and painful process. How about the following solution: - the daemons are aware that the checkpointing is enabled. They can set the environment variable which will force the opal_event_include to be set t

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Jeff Squyres
George added an MCA parameter for it (opal_event_include is a string that can be set to "select" or "poll"), but it has to be set before opal_init(). Josh: could you try running with the MCA parameter opal_event_include set to "select"? This would confirm Brian's hypothesis... Given that

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Paul H. Hargrove
If avoiding epoll() makes Josh's problems go away, PLEASE let me know because that might indicate a deficiency in BLCR that I would want to address. -Paul Brian W. Barrett wrote: > Jeff / George - > > Did you add a way to specify which event modules are used? Because epoll > pushs the socket l

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Brian W. Barrett
Jeff / George - Did you add a way to specify which event modules are used? Because epoll pushs the socket list into the kernel, I can see how it would screw up BLCR. I bet everything would work if we forced the use of poll / select. Brian On Tue, 18 Mar 2008, Jeff Squyres wrote: Crud, ok

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Jeff Squyres
Crud, ok. Keep us posted. On Mar 18, 2008, at 4:16 PM, Josh Hursey wrote: I'm testing with checkpoint/restart and the new libevent seems to be messing up the checkpoints generated by BLCR. I'll be taking a look at it over the next couple of days, but just thought I'd let people know. Unfortuna

Re: [OMPI devel] RFC: libevent update

2008-03-18 Thread Josh Hursey
I'm testing with checkpoint/restart and the new libevent seems to be messing up the checkpoints generated by BLCR. I'll be taking a look at it over the next couple of days, but just thought I'd let people know. Unfortunately I don't have any more details at the moment. -- Josh On Mar 17, 2