HI Nate,

Many thanks for these ideas - our HPC guys are going to try a few things. 
Hopefully we'll nail the problem and be able to report back in case someone 
else has the same issues.


Best Wishes,
David.

__________________________________
Dr David A. Matthews

Senior Lecturer in Virology
Room E49
Department of Cellular and Molecular Medicine,
School of Medical Sciences
University Walk,
University of Bristol
Bristol.
BS8 1TD
U.K.

Tel. +44 117 3312058
Fax. +44 117 3312091

d.a.matth...@bristol.ac.uk






On 19 Dec 2011, at 15:56, Nate Coraor wrote:

> On Dec 14, 2011, at 6:13 PM, David Matthews wrote:
> 
>> Hi Guys,
>> 
>> Sorry to be a pain but this seems to be getting worse for us. Here are the 
>> latest tracebacks - any suggestions would be gratefully received!!
> 
> Hi David,
> 
> As the MemoryError indicates, the Galaxy process is running out of memory.  
> debug = False is preferable, actually.  I asked because having debug = True 
> could easily result in the behavior you're seeing.
> 
> The pbs code definitely has a memory leak, I believe within libtorque or 
> pbs_python.  Because of this, I restart my job runner process when it reaches 
> a certain amount of memory usage.  However, this may not be the cause of your 
> errors.  To figure it out, we'll need to know exactly which thread is 
> consuming the memory.  You may want to enable the heartbeat log and look 
> there to see which threads are active.
> 
> The question about the path was in reference to whether these errors occur 
> immediately upon running a tophat job, without any interaction, or if they 
> occur when you try to click to view the job's output, or on some other part 
> of the Galaxy interface.
> 
> Thanks,
> --nate
> 
>> 
>> Cheers
>> David
>> 
>> 
>> 
>>> galaxy.jobs.runners.pbs ERROR 2011-12-13 19:57:57,689 Uncaught exception 
>>> checking jobs
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
>>>  line 338, in monitor
>>>  self.check_watched_items()
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
>>>  line 351, in check_watched_items
>>>  ( failures, statuses ) = self.check_all_jobs()
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
>>>  line 462, in check_all_jobs
>>>  statuses.update( self.convert_statjob_to_bunches( jobs ) )
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/lib/galaxy/jobs/runners/pbs.py",
>>>  line 476, in convert_statjob_to_bunches
>>>  statuses[ job.name ] = Bunch( **status )
>>> MemoryError
>>> Unhandled exception in thread started by
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 580, in __bootstrap_inner
>>> MemoryError
>>> Unhandled exception in thread started by <bound method Thread.__bootstrap 
>>> of <Thread(Thread-11, stopped 1111390528)>>
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 545, in __bootstrap_inner
>>> MemoryError
>>> Unexpected exception in worker <function <lambda> at 0x883acf8>
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 863, in worker_thread_callback
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1037, in <lambda>
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1056, in process_request_in_thread
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1044, in handle_error
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 
>>> 334, in handle_error
>>> MemoryError
>>> Unhandled exception in thread started by <bound method Thread.__bootstrap 
>>> of <Thread(Thread-10, stopped 1109289280)>>
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 545, in __bootstrap_inner
>>> MemoryError
>>> ----------------------------------------
>>> Exception happened during processing of request from ('xxx.xxx.xxx.xxx', 
>>> 44389)
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1053, in process_request_in_thread
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 
>>> 322, in finish_request
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 
>>> 616, in __init__
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 
>>> 657, in setup
>>> MemoryError
>>> ----------------------------------------
>>> ----------------------------------------
>>> Exception happened during processing of request from ('xxx.xxx.xx.xx', 
>>> 60069)
>>> Unexpected exception in worker <function <lambda> at 0x883a2a8>Traceback 
>>> (most recent call last):
>>> 
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1053, in process_request_in_thread
>>> Unhandled exception in thread started by <bound method Thread.__bootstrap 
>>> of <Thread(worker 9, stopped 1130301760)>>
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 545, in __bootstrap_inner
>>> MemoryError  File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/SocketServer.py", line 322, 
>>> in finish_request
>>> 
>>> Unexpected exception in worker <function <lambda> at 0x8721410>
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 863, in worker_thread_callback
>>> Unhandled exception in thread started by <bound method Thread.__bootstrap 
>>> of <Thread(worker 0, stopped 1086265664)>>
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 545, in __bootstrap_inner
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 242, in format_exc
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 142, in format_exception
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 76, in format_tb
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 101, in extract_tb
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 14, in getline
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 40, in getlines
>>> MemoryError
>>> ----------------------------------------
>>> Exception happened during processing of request from ('xxx.xxx.xx.xx', 
>>> 60071)
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1053, in process_request_in_thread
>>> Unexpected exception in worker <function <lambda> at 0x8721410>
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 863, in worker_thread_callback
>>> Unhandled exception in thread started by <bound method Thread.__bootstrap 
>>> of <Thread(worker 6, stopped 1123998016)>>
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>>  self.__bootstrap_inner()
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 545, in __bootstrap_inner
>>>  (self.name, _format_exc()))
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 242, in format_exc
>>>  return ''.join(format_exception(etype, value, tb, limit))
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 142, in format_exception
>>>  list = list + format_tb(tb, limit)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 76, in format_tb
>>>  return format_list(extract_tb(tb, limit))
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 101, in extract_tb
>>>  line = linecache.getline(filename, lineno, f.f_globals)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 14, in getline
>>>  lines = getlines(filename, module_globals)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 40, in getlines
>>>  return updatecache(filename, module_globals)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 131, in updatecache
>>>  lines = fp.readlines()
>>> MemoryError
>>> ----------------------------------------
>>> Exception happened during processing of request from ('xxx.xxx.xxx.xxx', 
>>> 44416)
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 1053, in process_request_in_thread
>>> Unexpected exception in worker <function <lambda> at 0x8721410>
>>> Traceback (most recent call last):
>>> File 
>>> "/gpfs/cluster/isys/galaxy/Galaxy/galaxy-dist/eggs/Paste-1.6-py2.6.egg/paste/httpserver.py",
>>>  line 863, in worker_thread_callback
>>> Unhandled exception in thread started by <bound method Thread.__bootstrap 
>>> of <Thread(worker 7, stopped 1126099264)>>
>>> Traceback (most recent call last):
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 504, in __bootstrap
>>>  self.__bootstrap_inner()
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/threading.py", line 
>>> 545, in __bootstrap_inner
>>>  (self.name, _format_exc()))
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 242, in format_exc
>>>  return ''.join(format_exception(etype, value, tb, limit))
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 142, in format_exception
>>>  list = list + format_tb(tb, limit)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 76, in format_tb
>>>  return format_list(extract_tb(tb, limit))
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/traceback.py", line 
>>> 101, in extract_tb
>>>  line = linecache.getline(filename, lineno, f.f_globals)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 14, in getline
>>>  lines = getlines(filename, module_globals)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 40, in getlines
>>>  return updatecache(filename, module_globals)
>>> File "/gpfs/cluster/isys/galaxy/Galaxy/lib/python2.6/linecache.py", line 
>>> 131, in updatecache
>>>  lines = fp.readlines()
>>> MemoryError
>>> ----------------------------------------
>>> 
>>> -- 
>>> -----------------------------------------------------------
>>> Callum Wright                               
>>> HPC Systems Administrator           
>>> High Performance Computing
>>> University of Bristol
>>> 
>>> Phone:         0117 331 4429
>>> email:         c.wri...@bristol.ac.uk
>>> web:            www.acrc.bristol.ac.uk
>>> -----------------------------------------------------------
>>> 
>> 
>> 
> 

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to