Re: [OMPI devel] Open MPI error

2011-04-25 Thread Ralph Castain
It's a fairly big job. I did it for the trunk some time ago - the devel trunk type for app index is now uint32 - and set it up there to be easy to change in the future. Doing it for 1.5 series is non-trivial as it is hard-coded in quite a few places. Best advice I can give: Try using the trunk,

Re: [OMPI devel] Open MPI error

2011-04-25 Thread Joshua Hursey
Yeah this sounds like the limitation of the number of app contexts that we can use in ORTE. Since ompi-restart uses N app contexts to restart a job (one for each process in the original job), then it is possible that we can hit this limitation. I suspect that it should not be too difficult to c

Re: [OMPI devel] Open MPI error

2011-04-25 Thread Jeff Squyres
On Apr 25, 2011, at 3:28 PM, Ralph Castain wrote: >> Can someone provide a solution to this. > > Probably won't happen for awhile - this is something peculiar to the restart > mechanism. I'll make a note to look at it, but it would be a low priority. Kishor -- is this something you could work o

Re: [OMPI devel] Open MPI error

2011-04-25 Thread Ralph Castain
On Apr 25, 2011, at 11:21 AM, Kishor Kharbas wrote: > Hello Developers, > > I am using Open MPI-1.5.3 for performing experiments with checkpoint and > restart. > However when the number of nodes is more than 128, restart fails with an > segmentation fault. > > After debugging the code, I foun

[OMPI devel] Open MPI error

2011-04-25 Thread Kishor Kharbas
Hello Developers, I am using Open MPI-1.5.3 for performing experiments with checkpoint and restart. However when the number of nodes is more than 128, restart fails with an segmentation fault. After debugging the code, I found that the cause of this error is that variables of type int_8 are used

Re: [OMPI devel] Adaptive or fault-tolerant MPI

2011-04-25 Thread George Bosilca
Ken, At UTK we focus on developing two generic frameworks for scalable fault tolerant approaches. One is based on uncoordinated checkpoint/restart while the other is application level. 1) uncoordinated C/R based on message logging. Such approaches are fully automatic, rely on an external check

Re: [OMPI devel] Adaptive or fault-tolerant MPI

2011-04-25 Thread Ken Lloyd
Thanks. I've read your (Joshua Hersey's) Ph.D. thesis on fault tolerance using checkpointing with much interest. It would be of further interest to get the range of possible user requirements for defining the behaviors in response to various faults. Ken Lloyd On Fri, 2011-04-22 at 15:03 -0400, J