I've been working on integrating dmtcp into gridengine. I've gotten fairly
far, but I get the following when trying to restart a job saved on a different
host:
[16690] WARNING at connection.cpp:1237 in openFile; REASON='JWARNING(false)
failed'
_path = /var/spool/gridengine/pollux/job_scripts/27709
Message: Still waiting for the file to be created/restored by some other process
This is shell script being executed by the job, e.g.:
bash 17202 orion 255r REG 8,2 64 529009
/var/spool/gridengine/castor/job_scripts/27710
I can think of a couple ways to handle this:
- copy the job script to a different location on a shared network filesystem
before starting the original job that will be the same from every machine.
- Perhaps a dmtcp plugin that would transform the name? Is that possible?
Any other ideas?
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum