Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-20 Thread Nate Coraor
On Jan 11, 2012, at 12:18 PM, Ann Black wrote:

 Good Morning galaxy group!
 
 I was hoping that someone might have some ideas on a problem we have 
 experienced a handful of times running galaxy on our local cluster.  
 
 Occasionally we experience some communication timeouts between out cluster 
 head node and a compute node which will self heal. However, this in turn will 
 hang galaxy.  Below you will see output from our galaxy log file.  When the 
 ERROR happens (which is not often) it consistently seems to hang galaxy.  We 
 have to kill it off and restart it. We are running galaxy as a single PID at 
 this time (we are still just testing it out, etc) and it is running on our 
 head node (which we plan to move off of in the future). 
 
 galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075) state 
 change: job is running
 galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075) Unable 
 to check job status
 Traceback (most recent call last):
   File /data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py, line 236, in 
 check_watched_items
 state = self.ds.jobStatus( job_id )
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py, line 
 522, in jobStatus
 _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py, line 
 213, in c
 return f(*(args + (error_buffer, sizeof(error_buffer
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py, line 
 90, in error_check
 raise _ERRORS[code-1](code %s: %s % (code, error_buffer.value))
 DrmCommunicationException: code 2: failed receiving gdi request response for 
 mid=24442 (got syncron message receive timeout error).
 galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075) job 
 will now be errored
 galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075) User 
 killed running job, but error encountered removing from DRM queue: code 2: 
 failed receiving gdi request response for mid=2 (got syncron message 
 receive timeout error).
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up external 
 metadata files
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup 
 MetadataTempFile temp files from 
 database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON 
 object could be decoded: line 1 column 0 (char 0)
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up external 
 metadata files
 galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or more 
 inputs in error state
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up external 
 metadata files
 galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or more 
 inputs in error state
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up external 
 metadata files
 galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or more 
 inputs in error state
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up external 
 metadata files
 galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or more 
 inputs in error state
 
 
 Has anyone else experienced this or have some ideas on how we can further 
 debug to figure out why galaxy hangs?

Hi Ann,

The cause of the exception aside, this should be caught by the except block 
below it in drmaa.py (in check_watched_items()):

except Exception, e:
# so we don't kill the monitor thread
log.exception((%s/%s) Unable to check job status % ( 
galaxy_job_id, job_id ) )

What changeset are you running?

--nate

 
 Thanks much!
 
 Ann Black-Ziegelbein
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
  http://lists.bx.psu.edu/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-20 Thread Edward Kirton
yes, nate but that fails the job but it is, in fact, still running and
the error should be ignored

except Exception, e:
# so we don't kill the monitor thread
log.exception((%s/%s) Unable to check job status % (
galaxy_job_id, job_id ) )
log.warning((%s/%s) job will now be errored % (
galaxy_job_id, job_id ) )
drm_job_state.fail_message = Cluster could not complete job
self.work_queue.put( ( 'fail', drm_job_state ) )
continue

On Fri, Jan 20, 2012 at 9:40 AM, Nate Coraor n...@bx.psu.edu wrote:

 Hi Ann,

 The cause of the exception aside, this should be caught by the except block 
 below it in drmaa.py (in check_watched_items()):

            except Exception, e:
                # so we don't kill the monitor thread
                log.exception((%s/%s) Unable to check job status % ( 
 galaxy_job_id, job_id ) )

 What changeset are you running?

 --nate

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-19 Thread Shantanu Pavgi

I haven't added any such delay after DrmCommunicationException however I have 
thought about increasing it in the job monitor function itself (site specific 
change, not suggesting it in the blessed galaxy repo).

https://bitbucket.org/galaxy/galaxy-central/src/cb8d6a29ea22/lib/galaxy/jobs/runners/drmaa.py#cl-259

--
Shantanu

On Jan 18, 2012, at 5:25 PM, Edward Kirton wrote:

perhaps a 15sec sleep should also be added, since the scheduler is overwhelmed

{{{
except ( drmaa.DrmCommunicationException ), e:
  # Catch drmaa communication exception, log a warning message and
  # continue to monitor jobs.
  log.warning((%s/%s) Couldn't communicate with the cluster scheduler to check 
job status. %( galaxy_job_id, job_id ))
  # give scheduler time to catch up
  time.sleep( 15 )
  # keep old job state
  new_watched.append( drm_job_state )
  continue
}}}


On Wed, Jan 18, 2012 at 1:07 PM, Edward Kirton 
eskir...@lbl.govmailto:eskir...@lbl.gov wrote:
excellent, thanks for the correction!


On Wed, Jan 18, 2012 at 8:59 AM, Shantanu Pavgi 
pa...@uab.edumailto:pa...@uab.edu wrote:

Ed,

I think you may want to add job with DrmCommunicationException error back into 
the watched_jobs list.
{{{
except ( drmaa.DrmCommunicationException ), e:
  # Catch drmaa communication exception, log a warning message and
  # continue to monitor jobs.
  log.warning((%s/%s) Couldn't communicate with the cluster scheduler to check 
job status. %( galaxy_job_id, job_id ))
  # keep old job state
  new_watched.append( drm_job_state )
  continue
}}}

Here is a patch and related discussion on it: 
https://bitbucket.org/galaxy/galaxy-central/pull-request/23/catch-drmcommunication-exception

--
Shantanu


On Jan 13, 2012, at 2:11 PM, Edward Kirton wrote:

i had seen the job process die with this error:

if state != old_state:
UnboundLocalError: local variable 'state' referenced before assignment

since the DRM timeout is an intermittent error, i'm not absolutely positive i 
have nailed it (a bit more time will tell -- i'm keeping an eye on it with 
debug log messages), but it seems so.  i intended to share this as a patched 
clone when i became sure but when i read your email i just sent what i had 
immediately.  let us know if that seems to solve the problem for you, so we'll 
have some confirmation.

glad to help,
ed

p.s. i have another patch for output not returned from cluster errors that 
i'm also still validating, due to NFS race since we don't have inode metadata 
caching turned off as the galaxy developers suggest.

On Fri, Jan 13, 2012 at 8:06 AM, Ann Black 
annbl...@eng.uiowa.edumailto:annbl...@eng.uiowa.edu wrote:
Thanks so much! I have applied the fix to our env.

In looking over the logic, there was an existing exception block that would 
have caught the communication exception generically – but the job moved (in 
this scenario erroneously) the job into a failure workflow.  I would like to 
understand what ended up hanging galaxy – so it must be related to a valid job 
being moved into failure state ? Did you follow it down the rabbit hole by any 
chance to see what caused the hang in your env ?

Thanks again,

Ann

From: Edward Kirton eskir...@lbl.govmailto:eskir...@lbl.gov
Date: Thu, 12 Jan 2012 13:00:27 -0800
To: Ann Black annbl...@eng.uiowa.edumailto:annbl...@eng.uiowa.edu
Cc: galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu 
galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

sometimes the scheduler can't keep up with all the work in it's 15sec cycle, so 
it doesn't respond to some messages.  here's a fix i've been trying that seems 
to work.

in lib/galaxy/jobs/runners/drmaa.py:

def check_watched_items( self ):

Called by the monitor thread to look at each watched job and deal
with state changes.

new_watched = []
for drm_job_state in self.watched:
job_id = drm_job_state.job_id
galaxy_job_id = drm_job_state.job_wrapper.job_id
old_state = drm_job_state.old_state
try:
state = self.ds.jobStatus( job_id )
# InternalException was reported to be necessary on some DRMs, but
# this could cause failures to be detected as completion!  Please
# report if you experience problems with this.
except ( drmaa.InvalidJobException, drmaa.InternalException ), e:
# we should only get here if an orphaned job was put into the 
queue at app star
tup
log.debug((%s/%s) job left DRM queue with following message: 
%s % ( galaxy_jo
b_id, job_id, e ) )
self.work_queue.put( ( 'finish', drm_job_state ) )
continue
# BEGIN DRM TIMEOUT: Don't fail on scheduler communication error 
(probably just too busy)
except ( drmaa.DrmCommunicationException ), e:
log.warning((%s/%s) DRM Communication Exception

Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-18 Thread Edward Kirton
excellent, thanks for the correction!

On Wed, Jan 18, 2012 at 8:59 AM, Shantanu Pavgi pa...@uab.edu wrote:


  Ed,

  I think you may want to add job with DrmCommunicationException error
 back into the watched_jobs list.
 {{{
  except ( drmaa.DrmCommunicationException ), e:
   # Catch drmaa communication exception, log a warning message and
   # continue to monitor jobs.
   log.warning((%s/%s) Couldn't communicate with the cluster scheduler to
 check job status. %( galaxy_job_id, job_id ))
   # keep old job state
   new_watched.append( drm_job_state )
   continue
 }}}

  Here is a patch and related discussion on it:
 https://bitbucket.org/galaxy/galaxy-central/pull-request/23/catch-drmcommunication-exception


  --
 Shantanu


  On Jan 13, 2012, at 2:11 PM, Edward Kirton wrote:

  i had seen the job process die with this error:

  if state != old_state:
 UnboundLocalError: local variable 'state' referenced before assignment

  since the DRM timeout is an intermittent error, i'm not absolutely
 positive i have nailed it (a bit more time will tell -- i'm keeping an eye
 on it with debug log messages), but it seems so.  i intended to share this
 as a patched clone when i became sure but when i read your email i just
 sent what i had immediately.  let us know if that seems to solve the
 problem for you, so we'll have some confirmation.

  glad to help,
 ed

  p.s. i have another patch for output not returned from cluster errors
 that i'm also still validating, due to NFS race since we don't have inode
 metadata caching turned off as the galaxy developers suggest.

 On Fri, Jan 13, 2012 at 8:06 AM, Ann Black annbl...@eng.uiowa.edu wrote:

  Thanks so much! I have applied the fix to our env.

  In looking over the logic, there was an existing exception block that
 would have caught the communication exception generically – but the job
 moved (in this scenario erroneously) the job into a failure workflow.  I
 would like to understand what ended up hanging galaxy – so it must be
 related to a valid job being moved into failure state ? Did you follow it
 down the rabbit hole by any chance to see what caused the hang in your env
 ?

  Thanks again,

  Ann

   From: Edward Kirton eskir...@lbl.gov
 Date: Thu, 12 Jan 2012 13:00:27 -0800
 To: Ann Black annbl...@eng.uiowa.edu
 Cc: galaxy-dev@lists.bx.psu.edu galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

  sometimes the scheduler can't keep up with all the work in it's 15sec
 cycle, so it doesn't respond to some messages.  here's a fix i've been
 trying that seems to work.

  in lib/galaxy/jobs/runners/drmaa.py:

  def check_watched_items( self ):
 
 Called by the monitor thread to look at each watched job and deal
 with state changes.
 
 new_watched = []
 for drm_job_state in self.watched:
 job_id = drm_job_state.job_id
 galaxy_job_id = drm_job_state.job_wrapper.job_id
 old_state = drm_job_state.old_state
 try:
 state = self.ds.jobStatus( job_id )
 # InternalException was reported to be necessary on some
 DRMs, but
 # this could cause failures to be detected as completion!
  Please
 # report if you experience problems with this.
 except ( drmaa.InvalidJobException, drmaa.InternalException
 ), e:
 # we should only get here if an orphaned job was put into
 the queue at app star
 tup
 log.debug((%s/%s) job left DRM queue with following
 message: %s % ( galaxy_jo
 b_id, job_id, e ) )
 self.work_queue.put( ( 'finish', drm_job_state ) )
 continue
 # BEGIN DRM TIMEOUT: Don't fail on scheduler communication
 error (probably just too busy)
 except ( drmaa.DrmCommunicationException ), e:
 log.warning((%s/%s) DRM Communication Exception % (
 galaxy_job_id, job_id
  ))
 continue
 # END DRM TIMEOUT

 On Wed, Jan 11, 2012 at 9:18 AM, Ann Black annbl...@eng.uiowa.eduwrote:

  Good Morning galaxy group!

  I was hoping that someone might have some ideas on a problem we have
 experienced a handful of times running galaxy on our local cluster.

  Occasionally we experience some communication timeouts between out
 cluster head node and a compute node which will self heal. However, this in
 turn will hang galaxy.  Below you will see output from our galaxy log file.
  When the ERROR happens (which is not often) it consistently seems to hang
 galaxy.  We have to kill it off and restart it. We are running galaxy as a
 single PID at this time (we are still just testing it out, etc) and it is
 running on our head node (which we plan to move off of in the future).

  galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075)
 state change: job is running
 galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075)
 Unable

Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-13 Thread Edward Kirton
i had seen the job process die with this error:

if state != old_state:
UnboundLocalError: local variable 'state' referenced before assignment

since the DRM timeout is an intermittent error, i'm not absolutely positive
i have nailed it (a bit more time will tell -- i'm keeping an eye on it
with debug log messages), but it seems so.  i intended to share this as a
patched clone when i became sure but when i read your email i just sent
what i had immediately.  let us know if that seems to solve the problem for
you, so we'll have some confirmation.

glad to help,
ed

p.s. i have another patch for output not returned from cluster errors
that i'm also still validating, due to NFS race since we don't have inode
metadata caching turned off as the galaxy developers suggest.

On Fri, Jan 13, 2012 at 8:06 AM, Ann Black annbl...@eng.uiowa.edu wrote:

 Thanks so much! I have applied the fix to our env.

 In looking over the logic, there was an existing exception block that
 would have caught the communication exception generically – but the job
 moved (in this scenario erroneously) the job into a failure workflow.  I
 would like to understand what ended up hanging galaxy – so it must be
 related to a valid job being moved into failure state ? Did you follow it
 down the rabbit hole by any chance to see what caused the hang in your env
 ?

 Thanks again,

 Ann

 From: Edward Kirton eskir...@lbl.gov
 Date: Thu, 12 Jan 2012 13:00:27 -0800
 To: Ann Black annbl...@eng.uiowa.edu
 Cc: galaxy-dev@lists.bx.psu.edu galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

 sometimes the scheduler can't keep up with all the work in it's 15sec
 cycle, so it doesn't respond to some messages.  here's a fix i've been
 trying that seems to work.

 in lib/galaxy/jobs/runners/drmaa.py:

 def check_watched_items( self ):
 
 Called by the monitor thread to look at each watched job and deal
 with state changes.
 
 new_watched = []
 for drm_job_state in self.watched:
 job_id = drm_job_state.job_id
 galaxy_job_id = drm_job_state.job_wrapper.job_id
 old_state = drm_job_state.old_state
 try:
 state = self.ds.jobStatus( job_id )
 # InternalException was reported to be necessary on some DRMs,
 but
 # this could cause failures to be detected as completion!
  Please
 # report if you experience problems with this.
 except ( drmaa.InvalidJobException, drmaa.InternalException ),
 e:
 # we should only get here if an orphaned job was put into
 the queue at app star
 tup
 log.debug((%s/%s) job left DRM queue with following
 message: %s % ( galaxy_jo
 b_id, job_id, e ) )
 self.work_queue.put( ( 'finish', drm_job_state ) )
 continue
 # BEGIN DRM TIMEOUT: Don't fail on scheduler communication
 error (probably just too busy)
 except ( drmaa.DrmCommunicationException ), e:
 log.warning((%s/%s) DRM Communication Exception % (
 galaxy_job_id, job_id
  ))
 continue
 # END DRM TIMEOUT

 On Wed, Jan 11, 2012 at 9:18 AM, Ann Black annbl...@eng.uiowa.edu wrote:

 Good Morning galaxy group!

 I was hoping that someone might have some ideas on a problem we have
 experienced a handful of times running galaxy on our local cluster.

 Occasionally we experience some communication timeouts between out
 cluster head node and a compute node which will self heal. However, this in
 turn will hang galaxy.  Below you will see output from our galaxy log file.
  When the ERROR happens (which is not often) it consistently seems to hang
 galaxy.  We have to kill it off and restart it. We are running galaxy as a
 single PID at this time (we are still just testing it out, etc) and it is
 running on our head node (which we plan to move off of in the future).

 galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075)
 state change: job is running
 galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075)
 Unable to check job status
 Traceback (most recent call last):
   File /data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py, line 236, in
 check_watched_items
 state = self.ds.jobStatus( job_id )
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py,
 line 522, in jobStatus
 _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py,
 line 213, in c
 return f(*(args + (error_buffer, sizeof(error_buffer
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py,
 line 90, in error_check
 raise _ERRORS[code-1](code %s: %s % (code, error_buffer.value))
 DrmCommunicationException: code 2: failed receiving gdi request response
 for mid=24442 (got syncron message receive timeout error).
 galaxy.jobs.runners.drmaa WARNING

Re: [galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-12 Thread Edward Kirton
sometimes the scheduler can't keep up with all the work in it's 15sec
cycle, so it doesn't respond to some messages.  here's a fix i've been
trying that seems to work.

in lib/galaxy/jobs/runners/drmaa.py:

def check_watched_items( self ):

Called by the monitor thread to look at each watched job and deal
with state changes.

new_watched = []
for drm_job_state in self.watched:
job_id = drm_job_state.job_id
galaxy_job_id = drm_job_state.job_wrapper.job_id
old_state = drm_job_state.old_state
try:
state = self.ds.jobStatus( job_id )
# InternalException was reported to be necessary on some DRMs,
but
# this could cause failures to be detected as completion!
 Please
# report if you experience problems with this.
except ( drmaa.InvalidJobException, drmaa.InternalException ),
e:
# we should only get here if an orphaned job was put into
the queue at app star
tup
log.debug((%s/%s) job left DRM queue with following
message: %s % ( galaxy_jo
b_id, job_id, e ) )
self.work_queue.put( ( 'finish', drm_job_state ) )
continue
# BEGIN DRM TIMEOUT: Don't fail on scheduler communication
error (probably just too busy)
except ( drmaa.DrmCommunicationException ), e:
 log.warning((%s/%s) DRM Communication Exception % (
galaxy_job_id, job_id
 ))
continue
# END DRM TIMEOUT

On Wed, Jan 11, 2012 at 9:18 AM, Ann Black annbl...@eng.uiowa.edu wrote:

 Good Morning galaxy group!

 I was hoping that someone might have some ideas on a problem we have
 experienced a handful of times running galaxy on our local cluster.

 Occasionally we experience some communication timeouts between out cluster
 head node and a compute node which will self heal. However, this in turn
 will hang galaxy.  Below you will see output from our galaxy log file.
  When the ERROR happens (which is not often) it consistently seems to hang
 galaxy.  We have to kill it off and restart it. We are running galaxy as a
 single PID at this time (we are still just testing it out, etc) and it is
 running on our head node (which we plan to move off of in the future).

 galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075)
 state change: job is running
 galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075)
 Unable to check job status
 Traceback (most recent call last):
   File /data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py, line 236, in
 check_watched_items
 state = self.ds.jobStatus( job_id )
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py,
 line 522, in jobStatus
 _h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py,
 line 213, in c
 return f(*(args + (error_buffer, sizeof(error_buffer
   File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py,
 line 90, in error_check
 raise _ERRORS[code-1](code %s: %s % (code, error_buffer.value))
 DrmCommunicationException: code 2: failed receiving gdi request response
 for mid=24442 (got syncron message receive timeout error).
 galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075)
 job will now be errored
 galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075)
 User killed running job, but error encountered removing from DRM queue:
 code 2: failed receiving gdi request response for mid=2 (got syncron
 message receive timeout error).
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up
 external metadata files
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup
 MetadataTempFile temp files from
 database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON
 object could be decoded: line 1 column 0 (char 0)
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up
 external metadata files
 galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or
 more inputs in error state
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up
 external metadata files
 galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or
 more inputs in error state
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up
 external metadata files
 galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or
 more inputs in error state
 galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up
 external metadata files
 galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or
 more inputs in error state


 Has anyone else experienced this or have some ideas on how we can further
 debug to figure out why galaxy hangs?

 Thanks much!

 Ann Black-Ziegelbein

 ___
 Please keep all replies on 

[galaxy-dev] Galaxy Hang after DrmCommunicationException

2012-01-11 Thread Ann Black
Good Morning galaxy group!

I was hoping that someone might have some ideas on a problem we have
experienced a handful of times running galaxy on our local cluster.

Occasionally we experience some communication timeouts between out cluster
head node and a compute node which will self heal. However, this in turn
will hang galaxy.  Below you will see output from our galaxy log file.  When
the ERROR happens (which is not often) it consistently seems to hang galaxy.
We have to kill it off and restart it. We are running galaxy as a single PID
at this time (we are still just testing it out, etc) and it is running on
our head node (which we plan to move off of in the future).

galaxy.jobs.runners.drmaa DEBUG 2012-01-10 19:19:58,800 (1654/698075) state
change: job is running
galaxy.jobs.runners.drmaa ERROR 2012-01-10 20:57:47,021 (1654/698075) Unable
to check job status
Traceback (most recent call last):
  File /data/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py, line 236, in
check_watched_items
state = self.ds.jobStatus( job_id )
  File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/__init__.py,
line 522, in jobStatus
_h.c(_w.drmaa_job_ps, jobName, _ct.byref(status))
  File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/helpers.py, line
213, in c
return f(*(args + (error_buffer, sizeof(error_buffer
  File /data/galaxy-dist/eggs/drmaa-0.4b3-py2.7.egg/drmaa/errors.py, line
90, in error_check
raise _ERRORS[code-1](code %s: %s % (code, error_buffer.value))
DrmCommunicationException: code 2: failed receiving gdi request response for
mid=24442 (got syncron message receive timeout error).
galaxy.jobs.runners.drmaa WARNING 2012-01-10 20:58:05,090 (1654/698075) job
will now be errored
galaxy.jobs.runners.drmaa DEBUG 2012-01-10 20:59:06,396 (1654/698075) User
killed running job, but error encountered removing from DRM queue: code 2:
failed receiving gdi request response for mid=2 (got syncron message
receive timeout error).
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,896 Cleaning up external
metadata files
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:06,947 Failed to cleanup
MetadataTempFile temp files from
database/tmp/metadata_out_HistoryDatasetAssociation_2913_ZUTgBy: No JSON
object could be decoded: line 1 column 0 (char 0)
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:09,640 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:09,697 job 1656 unable to run: one or more
inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:10,121 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:10,159 job 1655 unable to run: one or more
inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:12,076 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:12,126 job 1657 unable to run: one or more
inputs in error state
galaxy.datatypes.metadata DEBUG 2012-01-10 20:59:13,601 Cleaning up external
metadata files
galaxy.jobs INFO 2012-01-10 20:59:13,650 job 1658 unable to run: one or more
inputs in error state


Has anyone else experienced this or have some ideas on how we can further
debug to figure out why galaxy hangs?

Thanks much!

Ann Black-Ziegelbein


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/