[galaxy-dev] SLURM timeouts

Sytchev, Ilya Fri, 24 Oct 2014 12:13:19 -0700

Hi,

I'm running a fork of galaxy-central latest_2014.08.11. The instance is
configured to run jobs on a SLURM cluster. The problem is that the SLURM
controller sometimes becomes too busy which results in errors like:


galaxy.jobs.runners.drmaa INFO 2014-10-23 21:10:47,768 (1813/22896754) job
left DRM queue with following message: code 1: slurm_load_jobs error:
Socket timed out on send/recv operation,job_id: 22896754


This causes Galaxy to assume that the job has failed:

galaxy.jobs.runners ERROR 2014-10-23 21:10:47,881 (1813/22896754) Job
output not returned from cluster: [Errno 2] No such file or directory:
'/n/regal/stemcellcommons/galaxy-stage/job_working_directory/001/1813/galax
y_1813.o'


This happens with both galaxy.jobs.runners.drmaa:DRMAAJobRunner and
galaxy.jobs.runners.slurm:SlurmJobRunner. Is there any way to handle this
condition in Galaxy?

Thanks,
Ilya


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] SLURM timeouts

Reply via email to