Hi

I have been testing dmtcp with Univa Grid Engine 8.1.7 and I was
wondering if you could cast some light on the results I have been
getting. I have been basing this on the grid engine integration scripts
Orion Poplawski from: https://github.com/opoplawski/gridengine_dmtcp
with some minor fairly cosmetic changes.  I'll attach the integration
scripts to the email.

basically I can run a simple sleeper test checkpoint , and kill the
process and restart manually with no issue.:
 

> [aneil2@argo-2 dmtcp-test]$ dmtcp_launch ./my_sleeper
> dmtcp_coordinator starting...
>     Host: argo-2 (172.16.1.3)
>     Port: 7779
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> start
> count = 0 |04/29/14 - 11:37:19 am|
> count = 1 |04/29/14 - 11:37:59 am|
> count = 2 |04/29/14 - 11:38:12 am|
> count = 3 |04/29/14 - 11:38:52 am|
> ^C
> [aneil2@argo-2 dmtcp-test]$ ./dmtcp_restart_script.sh
> dmtcp_coordinator starting...
>     Host: argo-2 (172.16.1.3)
>     Port: 7779
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> count = 2 |04/29/14 - 11:39:11 am|
> count = 3 |04/29/14 - 11:39:51 am|

 However, under the scheduler, here the jobs starts on node005, gets a
suspend signal, is check-pointed and restarted on node001:
 
> [aneil2@argo-2 dmtcp-test]$ cat Sleeper_test.o39775
> dmtcp_coordinator starting...
>     Host: node005 (172.16.1.15)
>     Port: 46770
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> start
> count = 0 |04/27/14 - 01:36:50 pm|
> count = 1 |04/27/14 - 01:37:30 pm|
> count = 2 |04/27/14 - 01:37:59 pm|
> dmtcp_coordinator starting...
>     Host: node001 (172.16.1.11)
>     Port: 46770
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> [110201] WARNING at dmtcp_restart.cpp:238 in createProcess;
> REASON='JWARNING(setsid() != -1) failed'
>      getsid(0) = 110201
>      (strerror((*__errno_location ()))) = Operation not permitted
> Message: Failed to restore this process as session leader.
> count = 2 |04/27/14 - 01:38:06 pm|
> count = 3 |04/27/14 - 01:38:46 pm|
> count = 4 |04/27/14 - 01:39:26 pm|

The restart seems to succeed, my concern is about the warnings. The
Univa support suggested that this was because of processor ownership:

> Alastair,
>
>   It looks like the actual job is the session leader but it does have
> the shepherd as its parent process.  The shepherd is also run as the
> 'admin' user and not the job submission user.  Perhaps the different
> processor ownership is causing a problem.  I would think it would make
> sense to talk to the dmtcp folks at this time to see if they have any
> suggestions.
>
>

So I am wondering if you have any suggestions or insight.  Just as a
quick test I created a test queue that has only one node so I could
suspend the jobs and have it restart on the same node and I see the same
error:

> cat Sleeper_test.o39946
> dmtcp_coordinator starting...
>     Host: node001 (172.16.1.11)
>     Port: 48908
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> start
> count = 0 |04/29/14 - 11:58:34 am|
> count = 1 |04/29/14 - 11:59:14 am|
> count = 2 |04/29/14 - 11:59:54 am|
> count = 3 |04/29/14 - 12:00:05 pm|
> dmtcp_coordinator starting...
>     Host: node001 (172.16.1.11)
>     Port: 48908
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> [124034] WARNING at dmtcp_restart.cpp:238 in createProcess;
> REASON='JWARNING(setsid() != -1) failed'
>      getsid(0) = 124034
>      (strerror((*__errno_location ()))) = Operation not permitted
> Message: Failed to restore this process as session leader.
> count = 3 |04/29/14 - 12:01:00 pm|
> count = 4 |04/29/14 - 12:01:40 pm|



Many thanks, Alastair Neil



Attachment: dmtcp-ckpt-scripts.tgz
Description: GNU Zip compressed data

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to