It's a fairly big job. I did it for the trunk some time ago - the devel trunk
type for app index is now uint32 - and set it up there to be easy to change in
the future. Doing it for 1.5 series is non-trivial as it is hard-coded in quite
a few places.
Best advice I can give: Try using the trunk,
Yeah this sounds like the limitation of the number of app contexts that we can
use in ORTE. Since ompi-restart uses N app contexts to restart a job (one for
each process in the original job), then it is possible that we can hit this
limitation.
I suspect that it should not be too difficult to c
On Apr 25, 2011, at 3:28 PM, Ralph Castain wrote:
>> Can someone provide a solution to this.
>
> Probably won't happen for awhile - this is something peculiar to the restart
> mechanism. I'll make a note to look at it, but it would be a low priority.
Kishor -- is this something you could work o
On Apr 25, 2011, at 11:21 AM, Kishor Kharbas wrote:
> Hello Developers,
>
> I am using Open MPI-1.5.3 for performing experiments with checkpoint and
> restart.
> However when the number of nodes is more than 128, restart fails with an
> segmentation fault.
>
> After debugging the code, I foun
Hello Developers,
I am using Open MPI-1.5.3 for performing experiments with checkpoint and
restart.
However when the number of nodes is more than 128, restart fails with an
segmentation fault.
After debugging the code, I found that the cause of this error is that
variables of type int_8 are used
Ken,
At UTK we focus on developing two generic frameworks for scalable fault
tolerant approaches. One is based on uncoordinated checkpoint/restart while the
other is application level.
1) uncoordinated C/R based on message logging. Such approaches are fully
automatic, rely on an external check
Thanks. I've read your (Joshua Hersey's) Ph.D. thesis on fault
tolerance using checkpointing with much interest. It would be of further
interest to get the range of possible user requirements for defining the
behaviors in response to various faults.
Ken Lloyd
On Fri, 2011-04-22 at 15:03 -0400, J