If there is no need for supporting multiple devices simultaneously, or need for
checksum, or some level of support for fault tolerance, then an MTL will be
enough.
george.
On Jun 16, 2011, at 09:47 , Peter Kjellström wrote:
> On Tuesday, June 14, 2011 06:25:52 PM Jeff Squyres wrote:
>> Thank
On Jun 16, 2011, at 3:47 AM, Peter Kjellström wrote:
>> I should say that if anyone is contemplating writing a new BTL, I'm happy
>> to get on the phone / webex with you for an intro to the OMPI code base,
>> point you in the right direction, etc. Ping me on/off list and we can
>> setup a time.
>
So the HNP/mpirun knows when the job is fully restarted. The code for
that is at:
orte/mca/snapc/full/snapc_full_global.c:1758
This should prevent ompi-checkpoint from starting a checkpoint before
the restart is complete. I suspect those are the errors that you are
talking about.
Since you are
Hello.
Thanks for yours answers.
I'ts as you said Josh, i'm trying to do something uncoordinated, and on
demand. What i'm doing now is to put some code in the btl_tcp_endpoint.c and
others file that allows me to change the attempts of communication in the
sockets when a failure occurs. At the mom
On Tuesday, June 14, 2011 06:25:52 PM Jeff Squyres wrote:
> Thanks Tim!
>
> I should say that if anyone is contemplating writing a new BTL, I'm happy
> to get on the phone / webex with you for an intro to the OMPI code base,
> point you in the right direction, etc. Ping me on/off list and we can