Thanks, Remy. I went through the cluster documentation and our Rocks environment seems to be configured properly, after all.
It appears that my issue may be related to the UCSC Main table browser. The jobs that Galaxy reports have failed are leaving the job_working_directory behind, with galaxy_#.e error files that contain "The remote data source application has not sent back a URL parameter in the request." [root@campusrocks2 7045]# pwd /campusdata/galaxy/galaxy/database/job_working_directory/007/7045 [root@campusrocks2 7045]# cat galaxy_7045.e The remote data source application has not sent back a URL parameter in the request. These errors correspond with empty dataset_#.dat files in /campusdata/galaxy/galaxy/database/files/011/: [root@campusrocks2 7045]# ll /campusdata/galaxy/galaxy/database/files/011/ -rw-rw-r-- 1 galaxy galaxy 0 Jan 20 11:54 dataset_11387.dat The job failures are intermittent. Sometimes, a job requesting the exact same dataset will succeed moments before or after a failed job. Is there perhaps a way to tell the table browser to retry when it fails to get the dataset it is requesting? Is that even what's going on? On Wed, Jan 20, 2016 at 7:05 AM, Rémy Dernat <remy...@gmail.com> wrote: > I forgot to point out the needs of sharing folders and checking the > UID/GID of the galaxy user between your systems (and his access to SGE). > > Remy > > 2016-01-20 16:00 GMT+01:00 Rémy Dernat <remy...@gmail.com>: > >> Hi Eric, >> >> Here we use both solutions: Galaxy and RocksCluster. In Galaxy, you have >> to define your jobs in "config/job_conf.xml" and you should probably source >> a file (search for "environment" in your galaxy.ini) before the submit >> process. In fact, you could have to set a DRMAA_LIBRARY_PATH to load your >> drmaa library; see >> https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#DRMAA >> >> Best, >> Remy >> >> 2016-01-19 20:28 GMT+01:00 Eric Shell <esh...@soe.ucsc.edu>: >> >>> I am trying to get a Galaxy instance running on a Rocks cluster. I am >>> able to run jobs with the local runner at this point, but I am having an >>> issue with the drmaa runner that I haven't been able to fix. When I submit >>> a job in Galaxy it is successfully submitted to the cluster and runs to >>> completion according to qacct, but Galaxy just reports "failure running >>> job". >>> >>> Here's what is written to paster.log when I submit a job: >>> >>> 69.181.235.240 - - [19/Jan/2016:11:24:31 -0700] "GET >>>> /api/histories/fb86c918c0d3d33b/contents?dataset_details=bae154fe2294752e%2C6fe732485990d2ac%2C604c4e6e60e997bc%2Cf015f1cb819ec50e%2C9f6f4b3cb6cf43eb%2C3d13d598882b6eb8%2C551006fddcb290ae%2C10b9bbc646c48387%2C7670dfdf35146bc5%2Ce0ec2cf59f1fc79e%2Cee30922e5e4854db%2C9e7a0ba216194210 >>>> HTTP/1.1" 200 - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows >>>> NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) >>>> Chrome/47.0.2526.111 Safari/537.36" >>>> 69.181.235.240 - - [19/Jan/2016:11:24:38 -0700] "GET >>>> /tool_runner/data_source_redirect?tool_id=ucsc_table_direct1 HTTP/1.1" 302 >>>> - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; >>>> x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 >>>> Safari/537.36" >>>> galaxy.tools.actions.__init__ INFO 2016-01-19 11:24:42,801 Handled >>>> output (327.778 ms) >>>> galaxy.tools.actions.__init__ INFO 2016-01-19 11:24:43,236 Verified >>>> access to datasets (0.023 ms) >>>> galaxy.tools.execute DEBUG 2016-01-19 11:24:43,343 Tool >>>> [ucsc_table_direct1] created job [7019] (919.481 ms) >>>> 69.181.235.240 - - [19/Jan/2016:11:24:42 -0700] "POST /tool_runner >>>> HTTP/1.1" 200 - "https://genome.ucsc.edu/cgi-bin/hgTables" >>>> "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like >>>> Gecko) Chrome/47.0.2526.111 Safari/537.36" >>>> galaxy.jobs DEBUG 2016-01-19 11:24:44,056 (7019) Working directory for >>>> job is: /campusdata/galaxy/galaxy/database/job_working_directory/007/7019 >>>> galaxy.jobs.handler DEBUG 2016-01-19 11:24:44,070 (7019) Dispatching to >>>> sge runner >>>> galaxy.jobs DEBUG 2016-01-19 11:24:44,378 (7019) Persisting job >>>> destination (destination id: sge_default) >>>> galaxy.jobs.runners DEBUG 2016-01-19 11:24:44,403 Job [7019] queued >>>> (332.423 ms) >>>> galaxy.jobs.handler INFO 2016-01-19 11:24:44,444 (7019) Job dispatched >>>> 69.181.235.240 - - [19/Jan/2016:11:24:44 -0700] "GET /api/genomes >>>> HTTP/1.1" 200 - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows >>>> NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) >>>> Chrome/47.0.2526.111 Safari/537.36" >>>> 69.181.235.240 - - [19/Jan/2016:11:24:44 -0700] "GET >>>> /api/datatypes?extension_only=False& HTTP/1.1" 200 - " >>>> https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; >>>> x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 >>>> Safari/537.36" >>>> 69.181.235.240 - - [19/Jan/2016:11:24:44 -0700] "GET >>>> /history/current_history_json HTTP/1.1" 200 - " >>>> https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; >>>> x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 >>>> Safari/537.36" >>>> galaxy.jobs.runners DEBUG 2016-01-19 11:24:46,399 (7019) command is: >>>> python /campusdata/galaxy/galaxy/tools/data_source/data_source.py >>>> /campusdata/galaxy/galaxy/database/files/011/dataset_11361.dat 0; >>>> return_code=$?; python >>>> "/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/set_metadata_IaPURP.py" >>>> "/campusdata/galaxy/galaxy/database/tmp/tmp9Qt0cv" >>>> "/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/galaxy.json" >>>> "/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_in_HistoryDatasetAssociation_13512_oucw5s,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_kwds_HistoryDatasetAssociation_13512_ZrUbrF,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_out_HistoryDatasetAssociation_13512_twCvq7,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_results_HistoryDatasetAssociation_13512_FO1cy9,/campusdata/galaxy/galaxy/database/files/011/dataset_11361.dat,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_override_HistoryDatasetAssociation_13512_Z_cUTF" >>>> 5242880; sh -c "exit $return_code" >>>> galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:46,787 (7019) >>>> submitting file >>>> /campusdata/galaxy/galaxy/database/job_working_directory/007/7019/galaxy_7019.sh >>>> galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:46,808 (7019) native >>>> specification is: -R y -pe mpi 8 -q small.q >>>> galaxy.jobs DEBUG 2016-01-19 11:24:46,828 (7019) Changing ownership of >>>> working directory with: /usr/bin/sudo -E scripts/external_chown_script.py >>>> /campusdata/galaxy/galaxy/database/job_working_directory/007/7019 eshell >>>> 100000 >>>> galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:47,020 (7019) >>>> submitting with credentials: eshell [uid: 38559] >>>> galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:47,129 (7019) Job >>>> script for external submission is: >>>> /campusdata/galaxy/galaxy/database/pbs/7019.jt_json >>>> galaxy.jobs.runners.drmaa INFO 2016-01-19 11:24:47,130 Running command >>>> ['/usr/bin/sudo', '-E', 'scripts/drmaa_external_runner.py', '38559', >>>> '/campusdata/galaxy/galaxy/database/pbs/7019.jt_json'] >>>> galaxy.jobs.runners.drmaa INFO 2016-01-19 11:24:47,981 (7019) queued as >>>> 116563 >>>> galaxy.jobs DEBUG 2016-01-19 11:24:48,198 (7019) Persisting job >>>> destination (destination id: sge_default) >>>> galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:48,823 (7019/116563) >>>> state change: job is queued and active >>>> 69.181.235.240 - - [19/Jan/2016:11:24:45 -0700] "GET >>>> /api/histories/fb86c918c0d3d33b/contents?dataset_details=bae154fe2294752e%2C6fe732485990d2ac%2C604c4e6e60e997bc%2Cf015f1cb819ec50e%2C9f6f4b3cb6cf43eb%2C3d13d598882b6eb8%2C551006fddcb290ae%2C10b9bbc646c48387%2C7670dfdf35146bc5%2Ce0ec2cf59f1fc79e%2Cee30922e5e4854db%2C9e7a0ba216194210 >>>> HTTP/1.1" 200 - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows >>>> NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) >>>> Chrome/47.0.2526.111 Safari/537.36" >>>> galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:53,532 (7019/116563) >>>> state change: job is running >>>> galaxy.jobs WARNING 2016-01-19 11:24:53,922 (7019) Ignoring state >>>> change from 'error' to 'running' for job that is already terminal >>>> 69.181.235.240 - - [19/Jan/2016:11:24:54 -0700] "GET >>>> /api/histories/fb86c918c0d3d33b/contents HTTP/1.1" 200 - " >>>> https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; >>>> x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 >>>> Safari/537.36" >>>> 69.181.235.240 - - [19/Jan/2016:11:24:55 -0700] "GET >>>> /api/histories/fb86c918c0d3d33b HTTP/1.1" 200 - " >>>> https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; >>>> x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 >>>> Safari/537.36" >>>> galaxy.jobs.runners.drmaa INFO 2016-01-19 11:25:06,240 (7019/116563) >>>> job left DRM queue with following message: code 18: The job specified by >>>> the 'jobid' does not exist. >>>> galaxy.jobs DEBUG 2016-01-19 11:25:06,412 (7019) Changing ownership of >>>> working directory with: /usr/bin/sudo -E scripts/external_chown_script.py >>>> /campusdata/galaxy/galaxy/database/job_working_directory/007/7019 galaxy >>>> 59997 >>>> galaxy.jobs DEBUG 2016-01-19 11:25:06,622 (7019) Changing ownership of >>>> working directory with: /usr/bin/sudo -E scripts/external_chown_script.py >>>> /campusdata/galaxy/galaxy/database/job_working_directory/007/7019 galaxy >>>> 59997 >>> >>> >>> 'qacct -j -o eshell' shows that job 7019 completed, though: >>> >>> qname all.q >>>> hostname campusrocks2-0-4.local >>>> group users >>>> owner eshell >>>> project NONE >>>> department defaultdepartment >>>> jobname g7019_ucsc_table_direct1_eshell_ucsc_edu >>>> jobnumber 116563 >>>> taskid undefined >>>> account sge >>>> priority 0 >>>> qsub_time Tue Jan 19 11:24:47 2016 >>>> start_time Tue Jan 19 11:24:53 2016 >>>> end_time Tue Jan 19 11:25:05 2016 >>>> granted_pe mpi >>>> slots 8 >>>> failed 0 >>>> exit_status 0 >>>> ru_wallclock 12 >>>> ru_utime 9.285 >>>> ru_stime 0.908 >>>> ru_maxrss 98384 >>>> ru_ixrss 0 >>>> ru_ismrss 0 >>>> ru_idrss 0 >>>> ru_isrss 0 >>>> ru_minflt 81778 >>>> ru_majflt 2 >>>> ru_nswap 0 >>>> ru_inblock 6728 >>>> ru_oublock 184 >>>> ru_msgsnd 0 >>>> ru_msgrcv 0 >>>> ru_nsignals 0 >>>> ru_nvcsw 13952 >>>> ru_nivcsw 301 >>>> cpu 10.192 >>>> mem 2.326 >>>> io 0.150 >>>> iow 0.000 >>>> maxvmem 448.820M >>>> arid undefined >>> >>> >>> Why does Galaxy not see the job after it has been submitted to the >>> cluster? >>> >>> Thanks in advance for your help! >>> >>> -- >>> Eric Shell >>> UNIX Software & Google Apps Administrator >>> Baskin School of Engineering >>> UC Santa Cruz >>> 831 459 4919 >>> >>> ___________________________________________________________ >>> Please keep all replies on the list by using "reply all" >>> in your mail client. To manage your subscriptions to this >>> and other Galaxy lists, please use the interface at: >>> https://lists.galaxyproject.org/ >>> >>> To search Galaxy mailing lists use the unified search at: >>> http://galaxyproject.org/search/mailinglists/ >>> >> >> > -- Eric Shell UNIX Software & Google Apps Administrator Baskin School of Engineering UC Santa Cruz 831 459 4919
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/