Hi I have been testing dmtcp with Univa Grid Engine 8.1.7 and I was wondering if you could cast some light on the results I have been getting. I have been basing this on the grid engine integration scripts Orion Poplawski from: https://github.com/opoplawski/gridengine_dmtcp with some minor fairly cosmetic changes. I'll attach the integration scripts to the email.
basically I can run a simple sleeper test checkpoint , and kill the process and restart manually with no issue.: > [aneil2@argo-2 dmtcp-test]$ dmtcp_launch ./my_sleeper > dmtcp_coordinator starting... > Host: argo-2 (172.16.1.3) > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > start > count = 0 |04/29/14 - 11:37:19 am| > count = 1 |04/29/14 - 11:37:59 am| > count = 2 |04/29/14 - 11:38:12 am| > count = 3 |04/29/14 - 11:38:52 am| > ^C > [aneil2@argo-2 dmtcp-test]$ ./dmtcp_restart_script.sh > dmtcp_coordinator starting... > Host: argo-2 (172.16.1.3) > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > count = 2 |04/29/14 - 11:39:11 am| > count = 3 |04/29/14 - 11:39:51 am| However, under the scheduler, here the jobs starts on node005, gets a suspend signal, is check-pointed and restarted on node001: > [aneil2@argo-2 dmtcp-test]$ cat Sleeper_test.o39775 > dmtcp_coordinator starting... > Host: node005 (172.16.1.15) > Port: 46770 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > start > count = 0 |04/27/14 - 01:36:50 pm| > count = 1 |04/27/14 - 01:37:30 pm| > count = 2 |04/27/14 - 01:37:59 pm| > dmtcp_coordinator starting... > Host: node001 (172.16.1.11) > Port: 46770 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > [110201] WARNING at dmtcp_restart.cpp:238 in createProcess; > REASON='JWARNING(setsid() != -1) failed' > getsid(0) = 110201 > (strerror((*__errno_location ()))) = Operation not permitted > Message: Failed to restore this process as session leader. > count = 2 |04/27/14 - 01:38:06 pm| > count = 3 |04/27/14 - 01:38:46 pm| > count = 4 |04/27/14 - 01:39:26 pm| The restart seems to succeed, my concern is about the warnings. The Univa support suggested that this was because of processor ownership: > Alastair, > > It looks like the actual job is the session leader but it does have > the shepherd as its parent process. The shepherd is also run as the > 'admin' user and not the job submission user. Perhaps the different > processor ownership is causing a problem. I would think it would make > sense to talk to the dmtcp folks at this time to see if they have any > suggestions. > > So I am wondering if you have any suggestions or insight. Just as a quick test I created a test queue that has only one node so I could suspend the jobs and have it restart on the same node and I see the same error: > cat Sleeper_test.o39946 > dmtcp_coordinator starting... > Host: node001 (172.16.1.11) > Port: 48908 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > start > count = 0 |04/29/14 - 11:58:34 am| > count = 1 |04/29/14 - 11:59:14 am| > count = 2 |04/29/14 - 11:59:54 am| > count = 3 |04/29/14 - 12:00:05 pm| > dmtcp_coordinator starting... > Host: node001 (172.16.1.11) > Port: 48908 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > [124034] WARNING at dmtcp_restart.cpp:238 in createProcess; > REASON='JWARNING(setsid() != -1) failed' > getsid(0) = 124034 > (strerror((*__errno_location ()))) = Operation not permitted > Message: Failed to restore this process as session leader. > count = 3 |04/29/14 - 12:01:00 pm| > count = 4 |04/29/14 - 12:01:40 pm| Many thanks, Alastair Neil
dmtcp-ckpt-scripts.tgz
Description: GNU Zip compressed data
------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available. Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
