Hi, I'm running a fork of galaxy-central latest_2014.08.11. The instance is configured to run jobs on a SLURM cluster. The problem is that the SLURM controller sometimes becomes too busy which results in errors like:
galaxy.jobs.runners.drmaa INFO 2014-10-23 21:10:47,768 (1813/22896754) job left DRM queue with following message: code 1: slurm_load_jobs error: Socket timed out on send/recv operation,job_id: 22896754 This causes Galaxy to assume that the job has failed: galaxy.jobs.runners ERROR 2014-10-23 21:10:47,881 (1813/22896754) Job output not returned from cluster: [Errno 2] No such file or directory: '/n/regal/stemcellcommons/galaxy-stage/job_working_directory/001/1813/galax y_1813.o' This happens with both galaxy.jobs.runners.drmaa:DRMAAJobRunner and galaxy.jobs.runners.slurm:SlurmJobRunner. Is there any way to handle this condition in Galaxy? Thanks, Ilya ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
