Orion Thanks for all your excellent work.
On 04/30/2014 02:11 PM, Orion Poplawski wrote: > On 04/29/2014 10:04 AM, Alastair Neil wrote: >> Hi >> >> I have been testing dmtcp with Univa Grid Engine 8.1.7 and I was >> wondering if you could cast some light on the results I have been >> getting. I have been basing this on the grid engine integration scripts >> Orion Poplawski from: https://github.com/opoplawski/gridengine_dmtcp >> with some minor fairly cosmetic changes. I'll attach the integration >> scripts to the email. > > My ears are burning :) > > Taking a look at your changes to dmtcp_starter, I'm curious about: > > - You are not starting a new dmtcp_coordinator when you are restarting a > job. Isn't it necessary? I saw in the job output that it was starting the coordinator twice on restart so I figured the restart script was starting the coordinator. It seems to work. this is from the original: > cat Sleeper_test.o37153 > dmtcp_coordinator starting... > Host: node005 (172.16.1.15) > Port: 37876 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > start > count = 0 |04/21/14 - 05:15:02 pm| > count = 1 |04/21/14 - 05:15:42 pm| > count = 2 |04/21/14 - 05:15:55 pm| > dmtcp_coordinator starting... > Host: node006 (172.16.1.16) > Port: 48498 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > dmtcp_coordinator starting... > Host: node006 (172.16.1.16) > Port: 37876 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > [40197] WARNING at dmtcp_restart.cpp:238 in createProcess; > REASON='JWARNING(setsid() != -1) failed' > getsid(0) = 40197 > (strerror((*__errno_location ()))) = Operation not permitted > Message: Failed to restore this process as session leader. > count = 2 |04/21/14 - 05:16:03 pm| > count = 3 |04/21/14 - 05:16:43 pm| you see it starts the coordinator with new port on node006, then starts it again with the port of the original coordinator on node005. > > - What trouble did using $HOSTNAME instead of `hostname` cause? > The issue was that $HOSTNAME had the fully qualified domain name and hostname just the node name. for whatever reason dmtcp barfed on being given the fqdn > - Thanks for the handling of SGE_STARTER_SHELL_START_MODE, I'll check > that in. I'm still not one hundred percent sure this is right I'll keep testing it. -Alastair
<<attachment: ajn.vcf>>
------------------------------------------------------------------------------ "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE Instantly run your Selenium tests across 300+ browser/OS combos. Get unparalleled scalability from the best Selenium testing platform available. Simple to use. Nothing to install. Get started now for free." http://p.sf.net/sfu/SauceLabs
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
