Re: [galaxy-dev] galaxy on ubuntu 14.04: hangs on metadata cleanup

2014-05-08 Thread Jorrit Boekel
It seems to be an NFS related issue. When I run a separate VM as an NFS server 
that hosts the galaxy data (files, job workdir, tmp, ftp), problems are gone. 
There’s probably an explanation for that, but I’m going to leave it at this.

cheers,
— 
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden



On 07 May 2014, at 16:03, Jorrit Boekel jorrit.boe...@scilifelab.se wrote:

 I should probably mention that the data filesystem is NFS, exported by the 
 master from /mnt/galaxy/data and mounted on the worker. No separate 
 fileserver. Master is the one that hangs.
 
 
 cheers,
 — 
 Jorrit Boekel
 Proteomics systems developer
 BILS / Lehtiö lab
 Scilifelab Stockholm, Sweden
 
 
 
 On 07 May 2014, at 15:57, Jorrit Boekel jorrit.boe...@scilifelab.se wrote:
 
 Dear all,
 
 Has anyone tried running Galaxy on Ubuntu 14.04?
 
 I’m trying a test setup on two virtual machines (worker+master) with a SLURM 
 queue. Getting in strange problems when jobs finish, the master hangs, 
 completely unresponsive with CPU at 100% (as reported by virt-manager, not 
 by top). Only drmaa jobs seem to be affected. After hanging, a reboot shows 
 the job is finished (and green in history).
 
 It took me some debugging to figure out where things go wrong, but it seems 
 it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py 
 in method cleanup_external_metadata. I can reproduce the problem by calling 
 os.remove(metadatafile) by hand (in an interactive python shell) when using 
 pdb to create a breakpoint just before the call. If I comment out the 
 os.remove it runs on until it hits another delete call in 
 lib/galaxy/jobs/__init__.py:
 self.app.object_store.delete(self.get_job(), base_dir='job_work', 
 entire_dir=True, dir_only=True, extra_dir=str(self.job_id))
 It’s in the JobWrapper class in the cleanup() method. I should mention here 
 that my galaxy version is a bit old since I’m running my own fork with local 
 modifications on datatypes.
 
 This object_store.delete also leads to a shutil.rmtree and os.remove 
 function. So, remove calls to the filesystem seem to hang the whole thing, 
 but only at this point in time. Rebooting and removing by hand is no 
 problem, pdb-stepping also sometimes fixes it (but if I just press continue 
 it hangs). I don’t know where to go from here with debugging, but has anyone 
 seen anything similar? Right now it feels like it may be caused by timing 
 rather than actual code problems.
 
 cheers,
 — 
 Jorrit Boekel
 Proteomics systems developer
 BILS / Lehtiö lab
 Scilifelab Stockholm, Sweden
 
 
 
 


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


[galaxy-dev] galaxy on ubuntu 14.04: hangs on metadata cleanup

2014-05-07 Thread Jorrit Boekel
Dear all,

Has anyone tried running Galaxy on Ubuntu 14.04?

I’m trying a test setup on two virtual machines (worker+master) with a SLURM 
queue. Getting in strange problems when jobs finish, the master hangs, 
completely unresponsive with CPU at 100% (as reported by virt-manager, not by 
top). Only drmaa jobs seem to be affected. After hanging, a reboot shows the 
job is finished (and green in history).

It took me some debugging to figure out where things go wrong, but it seems it 
goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py in 
method cleanup_external_metadata. I can reproduce the problem by calling 
os.remove(metadatafile) by hand (in an interactive python shell) when using pdb 
to create a breakpoint just before the call. If I comment out the os.remove it 
runs on until it hits another delete call in lib/galaxy/jobs/__init__.py:
self.app.object_store.delete(self.get_job(), base_dir='job_work', 
entire_dir=True, dir_only=True, extra_dir=str(self.job_id))
It’s in the JobWrapper class in the cleanup() method. I should mention here 
that my galaxy version is a bit old since I’m running my own fork with local 
modifications on datatypes.

This object_store.delete also leads to a shutil.rmtree and os.remove function. 
So, remove calls to the filesystem seem to hang the whole thing, but only at 
this point in time. Rebooting and removing by hand is no problem, pdb-stepping 
also sometimes fixes it (but if I just press continue it hangs). I don’t know 
where to go from here with debugging, but has anyone seen anything similar? 
Right now it feels like it may be caused by timing rather than actual code 
problems.

cheers,
— 
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden




___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] galaxy on ubuntu 14.04: hangs on metadata cleanup

2014-05-07 Thread Jorrit Boekel
I should probably mention that the data filesystem is NFS, exported by the 
master from /mnt/galaxy/data and mounted on the worker. No separate fileserver. 
Master is the one that hangs.


cheers,
— 
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden



On 07 May 2014, at 15:57, Jorrit Boekel jorrit.boe...@scilifelab.se wrote:

 Dear all,
 
 Has anyone tried running Galaxy on Ubuntu 14.04?
 
 I’m trying a test setup on two virtual machines (worker+master) with a SLURM 
 queue. Getting in strange problems when jobs finish, the master hangs, 
 completely unresponsive with CPU at 100% (as reported by virt-manager, not by 
 top). Only drmaa jobs seem to be affected. After hanging, a reboot shows the 
 job is finished (and green in history).
 
 It took me some debugging to figure out where things go wrong, but it seems 
 it goes wrong when os.remove is called in lib/galaxy/datatypes/metadata.py in 
 method cleanup_external_metadata. I can reproduce the problem by calling 
 os.remove(metadatafile) by hand (in an interactive python shell) when using 
 pdb to create a breakpoint just before the call. If I comment out the 
 os.remove it runs on until it hits another delete call in 
 lib/galaxy/jobs/__init__.py:
 self.app.object_store.delete(self.get_job(), base_dir='job_work', 
 entire_dir=True, dir_only=True, extra_dir=str(self.job_id))
 It’s in the JobWrapper class in the cleanup() method. I should mention here 
 that my galaxy version is a bit old since I’m running my own fork with local 
 modifications on datatypes.
 
 This object_store.delete also leads to a shutil.rmtree and os.remove 
 function. So, remove calls to the filesystem seem to hang the whole thing, 
 but only at this point in time. Rebooting and removing by hand is no problem, 
 pdb-stepping also sometimes fixes it (but if I just press continue it hangs). 
 I don’t know where to go from here with debugging, but has anyone seen 
 anything similar? Right now it feels like it may be caused by timing rather 
 than actual code problems.
 
 cheers,
 — 
 Jorrit Boekel
 Proteomics systems developer
 BILS / Lehtiö lab
 Scilifelab Stockholm, Sweden
 
 
 


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/