Orion

Thanks for all your excellent work.

On 04/30/2014 02:11 PM, Orion Poplawski wrote:
> On 04/29/2014 10:04 AM, Alastair Neil wrote:
>> Hi
>>
>> I have been testing dmtcp with Univa Grid Engine 8.1.7 and I was
>> wondering if you could cast some light on the results I have been
>> getting. I have been basing this on the grid engine integration scripts
>> Orion Poplawski from: https://github.com/opoplawski/gridengine_dmtcp
>> with some minor fairly cosmetic changes.  I'll attach the integration
>> scripts to the email.
> 
> My ears are burning :)
> 
> Taking a look at your changes to dmtcp_starter, I'm curious about:
> 
> - You are not starting a new dmtcp_coordinator when you are restarting a
> job.  Isn't it necessary?

I saw in the job output that it was starting the coordinator twice on
restart so I figured the restart script was starting the coordinator.
It seems to work.

this is from the original:

> cat Sleeper_test.o37153
> dmtcp_coordinator starting...
>     Host: node005 (172.16.1.15)
>     Port: 37876
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> start
> count = 0 |04/21/14 - 05:15:02 pm|
> count = 1 |04/21/14 - 05:15:42 pm|
> count = 2 |04/21/14 - 05:15:55 pm|
> dmtcp_coordinator starting...
>     Host: node006 (172.16.1.16)
>     Port: 48498
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> dmtcp_coordinator starting...
>     Host: node006 (172.16.1.16)
>     Port: 37876
>     Checkpoint Interval: disabled (checkpoint manually instead)
>     Exit on last client: 1
> Backgrounding...
> [40197] WARNING at dmtcp_restart.cpp:238 in createProcess; 
> REASON='JWARNING(setsid() != -1) failed'
>      getsid(0) = 40197
>      (strerror((*__errno_location ()))) = Operation not permitted
> Message: Failed to restore this process as session leader.
> count = 2 |04/21/14 - 05:16:03 pm|
> count = 3 |04/21/14 - 05:16:43 pm|

you see it starts the coordinator with new port on node006, then starts
it again with the port of the original coordinator on node005.


> 
> - What trouble did using $HOSTNAME instead of `hostname` cause?
> 

The issue was that $HOSTNAME had the fully qualified domain name and
hostname just  the node name.  for whatever reason dmtcp barfed on being
given the fqdn

> - Thanks for the handling of SGE_STARTER_SHELL_START_MODE, I'll check
> that in.

I'm still not one hundred percent sure this is right I'll keep testing it.

-Alastair

<<attachment: ajn.vcf>>

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.  Get 
unparalleled scalability from the best Selenium testing platform available.
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to