Re: [galaxy-dev] Jobs remain in queue until restart

2013-08-14 Thread Nate Coraor
On Aug 2, 2013, at 1:06 PM, Thon de Boer wrote:

 I did some more investigation of this issue
  
 I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 
 4 handler processes running (Plus my web server), but not even getting more 
 than 10% of the CPU each.
 There seems to be some process in my handlers that takes an incredible amount 
 of resources, even though TOP is not showing that (Show below)
  
 Has anyone have any idea how to figure out where the bottleneck is?
 Is there a way to turn on more detailed logging perhaps to see what each 
 process is doing?
  
 My IT guy suggested there may be some “context Switching” going on due to the 
 many threads that are running (I use a threadpool of 7 for each server), but 
 not sure how to address that issue…

Hi Thon,

It looks like it's probably the memory use - if you restart the Galaxy 
processes, do you see any change?

--nate

  
 Anyone?
  
 top - 10:00:53 up 37 days, 19:29,  8 users,  load average: 32.10, 32.10, 32.09
 Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
 Cpu(s):  4.8%us,  2.5%sy,  0.0%ni, 92.5%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
 Mem:  16334504k total, 16164084k used,   170420k free,   127720k buffers
 Swap:  4194296k total,15228k used,  4179068k free,  2460252k cached
  
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 7190 svcgalax  20   0 2721m 284m 5976 S  9.9  1.8 142:53.84 python 
 ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 
 --pid-file=handler3.pid --log-file=handler3.log --daemon
 7183 svcgalax  20   0 2720m 286m 5984 S  6.4  1.8 135:52.63 python 
 ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 
 --pid-file=handler2.pid --log-file=handler2.log --daemon
 7175 svcgalax  20   0 2720m 287m 5976 S  5.6  1.8 117:59.40 python 
 ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 
 --pid-file=handler1.pid --log-file=handler1.log --daemon
 7166 svcgalax  20   0 3442m 2.7g 4884 S  4.6 17.5  74:31.66 python 
 ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 
 --pid-file=web0.pid --log-file=web0.log --daemon
 7172 svcgalax  20   0 2720m 294m 5984 S  4.0  1.8 133:17.19 python 
 ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 
 --pid-file=handler0.pid --log-file=handler0.log --daemon
 1564 root  20   0  291m  13m 7552 S  0.3  0.1   1:49.65 /usr/sbin/httpd
 7890 svcgalax  20   0 17216 1456 1036 S  0.3  0.0   2:15.73 top
 10682 apache20   0  297m  11m 3516 S  0.3  0.1   0:02.23 /usr/sbin/httpd
 11224 apache20   0  295m  11m 3236 S  0.3  0.1   0:00.29 /usr/sbin/httpd
 11263 svcgalax  20   0 17248 1460 1036 R  0.3  0.0   0:00.06 top
 1 root  20   0 21320 1040  784 S  0.0  0.0   0:00.95 /sbin/init
 2 root  20   0 000 S  0.0  0.0   0:00.01 [kthreadd]
 3 root  RT   0 000 S  0.0  0.0   0:06.35 [migration/0]
  
 Regards,
  
 Thon
  
 Thon deBoer Ph.D., Bioinformatics Guru 
 California, USA |p: +1 (650) 799-6839  |m:  thondeb...@me.com
  
 From: galaxy-dev-boun...@lists.bx.psu.edu 
 [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Thon Deboer
 Sent: Wednesday, July 17, 2013 11:31 PM
 To: galaxy-dev@lists.bx.psu.edu
 Subject: [galaxy-dev] Jobs remain in queue until restart
  
 Hi,
  
 I have noticed that from time to time, the job queue seems to be “stuck” and 
 can only be unstuck by restarting galaxy.
 The jobs seem to be in the queue state and the python job handler processes 
 are hardly ticking over and the cluster is empty.
  
 When I restart, the startup procedure realizes all jobs are in the a “new 
 state” and it then assigns a jobhandler after which the jobs start fine….
  
 Any ideas?
  
  
 Thon
  
 P.S I am using the june version of galaxy and I DO set limits on my users in 
 job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, 
 this user had started lots of jobs and may have hit the limit, but I assumed 
 this limit was the number of running jobs at one time, right?)
  
 ?xml version=1.0?
 job_conf
 plugins workers=4
 !-- workers is the number of threads for the runner's work queue.
  The default from plugins is used if not defined for a plugin.
   --
 plugin id=local type=runner 
 load=galaxy.jobs.runners.local:LocalJobRunner workers=2/
 plugin id=drmaa type=runner 
 load=galaxy.jobs.runners.drmaa:DRMAAJobRunner workers=8/
 plugin id=cli type=runner 
 load=galaxy.jobs.runners.cli:ShellJobRunner workers=2/
 /plugins
 handlers default=handlers
 !-- Additional job handlers - the id should match the name of a
  [server:id] in universe_wsgi.ini.
  --
 handler id=handler0 tags=handlers/
 handler id=handler1 tags=handlers/
 handler id=handler2 tags=handlers/
 handler id=handler3 tags=handlers/
 !-- handler id=handler10 tags=handlers/
 handler id=handler11

Re: [galaxy-dev] Jobs remain in queue until restart

2013-08-14 Thread Anthonius deBoer
I don't think it's a memory issue (but what made you say that?) since each process is hardly using any memory, although VIRT memory in top is showing 2.7GB per python process, RES is only ever going to 250MB and I have a 16GB machine (although SWAP is only 4GB but not using any of the swap either, so don't think memory is the issueI AM running this on a VM machine, but the physical machine is not doing much either...I'll run profile on it to see what is causing the massive load issue...ThonOn Aug 14, 2013, at 08:51 AM, Nate Coraor n...@bx.psu.edu wrote:On Aug 2, 2013, at 1:06 PM, Thon de Boer wrote: I did some more investigation of this issueI do notice that my 4 core, 8 slot VM machine has a load of 32, with only my 4 handler processes running (Plus my web server), but not even getting more than 10% of the CPU each.There seems to be some process in my handlers that takes an incredible amount of resources, even though TOP is not showing that (Show below)Has anyone have any idea how to figure out where the bottleneck is?Is there a way to turn on more detailed logging perhaps to see what each process is doing?My IT guy suggested there may be some “context Switching” going on due to the many threads that are running (I use a threadpool of 7 for each server), but not sure how to address that issue… Hi Thon,  It looks like it's probably the memory use - if you restart the Galaxy processes, do you see any change?  --nate Anyone?top - 10:00:53 up 37 days, 19:29, 8 users, load average: 32.10, 32.10, 32.09Tasks: 181 total, 1 running, 180 sleeping, 0 stopped, 0 zombieCpu(s): 4.8%us, 2.5%sy, 0.0%ni, 92.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%stMem: 16334504k total, 16164084k used, 170420k free, 127720k buffersSwap: 4194296k total, 15228k used, 4179068k free, 2460252k cachedPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND7190 svcgalax 20 0 2721m 284m 5976 S 9.9 1.8 142:53.84 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler3 --pid-file=handler3.pid --log-file=handler3.log --daemon7183 svcgalax 20 0 2720m 286m 5984 S 6.4 1.8 135:52.63 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler2 --pid-file=handler2.pid --log-file=handler2.log --daemon7175 svcgalax 20 0 2720m 287m 5976 S 5.6 1.8 117:59.40 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler1 --pid-file=handler1.pid --log-file=handler1.log --daemon7166 svcgalax 20 0 3442m 2.7g 4884 S 4.6 17.5 74:31.66 python ./scripts/paster.py serve universe_wsgi.ini --server-name=web0 --pid-file=web0.pid --log-file=web0.log --daemon7172 svcgalax 20 0 2720m 294m 5984 S 4.0 1.8 133:17.19 python ./scripts/paster.py serve universe_wsgi.ini --server-name=handler0 --pid-file=handler0.pid --log-file=handler0.log --daemon1564 root 20 0 291m 13m 7552 S 0.3 0.1 1:49.65 /usr/sbin/httpd7890 svcgalax 20 0 17216 1456 1036 S 0.3 0.0 2:15.73 top10682 apache 20 0 297m 11m 3516 S 0.3 0.1 0:02.23 /usr/sbin/httpd11224 apache 20 0 295m 11m 3236 S 0.3 0.1 0:00.29 /usr/sbin/httpd11263 svcgalax 20 0 17248 1460 1036 R 0.3 0.0 0:00.06 top1 root 20 0 21320 1040 784 S 0.0 0.0 0:00.95 /sbin/init2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 [kthreadd]3 root RT 0 0 0 0 S 0.0 0.0 0:06.35 [migration/0]Regards,ThonThon deBoer Ph.D., Bioinformatics GuruCalifornia, USA |p: +1 (650) 799-6839 |m: thondeb...@me.comFrom: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Thon DeboerSent: Wednesday, July 17, 2013 11:31 PMTo: galaxy-dev@lists.bx.psu.eduSubject: [galaxy-dev] Jobs remain in queue until restartHi,I have noticed that from time to time, the job queue seems to be “stuck” and can only be unstuck by restarting galaxy.The jobs seem to be in the queue state and the python job handler processes are hardly ticking over and the cluster is empty.When I restart, the startup procedure realizes all jobs are in the a “new state” and it then assigns a jobhandler after which the jobs start fine….Any ideas?ThonP.S I am using the june version of galaxy and I DO set limits on my users in job_conf.xml as so: (Maybe it is related? Before it went into dormant mode, this user had started lots of jobs and may have hit the limit, but I assumed this limit was the number of running jobs at one time, right?)?xml version="1.0"?job_confplugins workers="4"!-- "workers" is the number of threads for the runner's work queue.The default from plugins is used if not defined for a plugin.--plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner" workers="2"/plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner" workers="8"/plugin id="cli" type="runner" load="galaxy.jobs.runners.cli:ShellJobRunner" workers="2"//pluginshandlers default="handlers"!-- Additional job handlers - the id should match the name of a[server:id] in universe_wsgi.ini.--handler id="handler0" tags="handlers"/handler id="handler1" tags="handlers"/handler id="handler2" tags="handlers"/handler id="handler3" tags="handlers"/!-- 

Re: [galaxy-dev] Jobs remain in queue until restart

2013-08-02 Thread Thon de Boer
I did some more investigation of this issue

 

I do notice that my 4 core, 8 slot VM machine has a load of 32, with only my
4 handler processes running (Plus my web server), but not even getting more
than 10% of the CPU each.

There seems to be some process in my handlers that takes an incredible
amount of resources, even though TOP is not showing that (Show below)

 

Has anyone have any idea how to figure out where the bottleneck is?

Is there a way to turn on more detailed logging perhaps to see what each
process is doing?

 

My IT guy suggested there may be some context Switching going on due to
the many threads that are running (I use a threadpool of 7 for each server),
but not sure how to address that issue.

 

Anyone?

 

top - 10:00:53 up 37 days, 19:29,  8 users,  load average: 32.10, 32.10,
32.09

Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie

Cpu(s):  4.8%us,  2.5%sy,  0.0%ni, 92.5%id,  0.0%wa,  0.0%hi,  0.2%si,
0.0%st

Mem:  16334504k total, 16164084k used,   170420k free,   127720k buffers

Swap:  4194296k total,15228k used,  4179068k free,  2460252k cached

 

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

7190 svcgalax  20   0 2721m 284m 5976 S  9.9  1.8 142:53.84 python
./scripts/paster.py serve universe_wsgi.ini --server-name=handler3
--pid-file=handler3.pid --log-file=handler3.log --daemon

7183 svcgalax  20   0 2720m 286m 5984 S  6.4  1.8 135:52.63 python
./scripts/paster.py serve universe_wsgi.ini --server-name=handler2
--pid-file=handler2.pid --log-file=handler2.log --daemon

7175 svcgalax  20   0 2720m 287m 5976 S  5.6  1.8 117:59.40 python
./scripts/paster.py serve universe_wsgi.ini --server-name=handler1
--pid-file=handler1.pid --log-file=handler1.log --daemon

7166 svcgalax  20   0 3442m 2.7g 4884 S  4.6 17.5  74:31.66 python
./scripts/paster.py serve universe_wsgi.ini --server-name=web0
--pid-file=web0.pid --log-file=web0.log --daemon

7172 svcgalax  20   0 2720m 294m 5984 S  4.0  1.8 133:17.19 python
./scripts/paster.py serve universe_wsgi.ini --server-name=handler0
--pid-file=handler0.pid --log-file=handler0.log --daemon

1564 root  20   0  291m  13m 7552 S  0.3  0.1   1:49.65 /usr/sbin/httpd

7890 svcgalax  20   0 17216 1456 1036 S  0.3  0.0   2:15.73 top

10682 apache20   0  297m  11m 3516 S  0.3  0.1   0:02.23 /usr/sbin/httpd

11224 apache20   0  295m  11m 3236 S  0.3  0.1   0:00.29 /usr/sbin/httpd

11263 svcgalax  20   0 17248 1460 1036 R  0.3  0.0   0:00.06 top

1 root  20   0 21320 1040  784 S  0.0  0.0   0:00.95 /sbin/init

2 root  20   0 000 S  0.0  0.0   0:00.01 [kthreadd]

3 root  RT   0 000 S  0.0  0.0   0:06.35 [migration/0]

 

Regards,

 

Thon

 

Thon deBoer Ph.D., Bioinformatics Guru 
California, USA |p: +1 (650) 799-6839  |m:   mailto:thondeb...@me.com
thondeb...@me.com

 

From: galaxy-dev-boun...@lists.bx.psu.edu
[mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Thon Deboer
Sent: Wednesday, July 17, 2013 11:31 PM
To: galaxy-dev@lists.bx.psu.edu
Subject: [galaxy-dev] Jobs remain in queue until restart

 

Hi,

 

I have noticed that from time to time, the job queue seems to be stuck and
can only be unstuck by restarting galaxy.

The jobs seem to be in the queue state and the python job handler processes
are hardly ticking over and the cluster is empty.

 

When I restart, the startup procedure realizes all jobs are in the a new
state and it then assigns a jobhandler after which the jobs start fine..

 

Any ideas?

 

 

Thon

 

P.S I am using the june version of galaxy and I DO set limits on my users in
job_conf.xml as so: (Maybe it is related? Before it went into dormant mode,
this user had started lots of jobs and may have hit the limit, but I assumed
this limit was the number of running jobs at one time, right?)

 

?xml version=1.0?

job_conf

plugins workers=4

!-- workers is the number of threads for the runner's work queue.

 The default from plugins is used if not defined for a
plugin.

  --

plugin id=local type=runner
load=galaxy.jobs.runners.local:LocalJobRunner workers=2/

plugin id=drmaa type=runner
load=galaxy.jobs.runners.drmaa:DRMAAJobRunner workers=8/

plugin id=cli type=runner
load=galaxy.jobs.runners.cli:ShellJobRunner workers=2/

/plugins

handlers default=handlers

!-- Additional job handlers - the id should match the name of a

 [server:id] in universe_wsgi.ini.

 --

handler id=handler0 tags=handlers/

handler id=handler1 tags=handlers/

handler id=handler2 tags=handlers/

handler id=handler3 tags=handlers/

!-- handler id=handler10 tags=handlers/

handler id=handler11 tags=handlers/

handler id=handler12 tags=handlers/

handler id=handler13 tags=handlers/

--

/handlers

destinations default=regularjobs

!-- Destinations define details