I've been working on integrating dmtcp into gridengine.  I've gotten fairly 
far, but I get the following when trying to restart a job saved on a different 
host:

[16690] WARNING at connection.cpp:1237 in openFile; REASON='JWARNING(false) 
failed'
      _path = /var/spool/gridengine/pollux/job_scripts/27709
Message: Still waiting for the file to be created/restored by some other process

This is shell script being executed by the job, e.g.:

bash    17202 orion  255r   REG     8,2       64   529009 
/var/spool/gridengine/castor/job_scripts/27710

I can think of a couple ways to handle this:

- copy the job script to a different location on a shared network filesystem 
before starting the original job that will be the same from every machine.

- Perhaps a dmtcp plugin that would transform the name?  Is that possible?

Any other ideas?

-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [email protected]
Boulder, CO 80301                   http://www.nwra.com

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to