Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Brian Burke
This sounds like it could be garbage collection related, especially with a heap 
that large.  Depending on your jvm tuning, a FGC could take quite a while, 
effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the 
garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

 I have a fairly classic master/slave set up.
 
 Response times on the slave are generally good with blips periodically, 
 apparently when replication is happening.
 
 Occasionally however the process will have one incredibly slow query and 
 will peg the CPU at 100%.
 
 The weird thing is that it will remain that way even if we stop querying 
 it and stop replication and then wait for over 20 minutes. The only way 
 to fix the problem at that point is to restart tomcat.
 
 Looking at slow queries around the time of the incident they don't look 
 particularly bad - they're predominantly filter queries running under 
 dismax and there doesn't seem to be anything unusual about them.
 
 The index file is about 266G and has 30G of disk free. The machine has 
 50G of RAM and is running with -Xmx35G.
 
 Looking at the processes running it appears to be the main Java thread 
 that's CPU bound, not the child threads. 
 
 Stracing the process gives a lot of brk instructions (presumably some 
 sort of wait loop) with occasional blips of: 
 
 
 mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
 325, {1294683789, 614186000}, ) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mmap(0x7fc2e023, 121962496, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fc2e023
 mmap(0x7fbca58e, 237568, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fbca58e
 
 Any ideas about what's happening and if there's anyway to mitigate it? 
 If the box at least recovered then I could run another slave and load 
 balance between them working on the principle that the second box 
 would pick up the slack whilst the first box restabilised but, as it is, 
 that's not reliable.
 
 Thanks,
 
 Simon
 



Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Dennis Gearon
One other possiblity is that the OS or BIOS is doing that, at least on a 
laptop. 
There is a new feature where, if the load is low enough, non multi threaded 
applications can be assigned to one processor and that processor has it's clock 
boosted so the older software will run faster on the new processors - Otherwise 
they run SLOWER!.

My brother has a cad program that runs slower on his new quad core because the 
base clock speed is slower than a single processor CPU. The software company is 
not taking the time to rewrite their code, excpet where they add features or 
fixes. 




- Original Message 

From: Brian Burke bbu...@techtarget.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Mon, January 10, 2011 10:56:27 AM
Subject: Re: Box occasionally pegs one cpu at 100%

This sounds like it could be garbage collection related, especially with a heap 
that large.  Depending on your jvm tuning, a FGC could take quite a while, 
effectively 'pausing' the JVM.

Have you looked at something like jstat -gcutil   or similar to monitor the 
garbage collection?


On Jan 10, 2011, at 1:36 PM, Simon Wistow wrote:

 I have a fairly classic master/slave set up.
 
 Response times on the slave are generally good with blips periodically, 
 apparently when replication is happening.
 
 Occasionally however the process will have one incredibly slow query and 
 will peg the CPU at 100%.
 
 The weird thing is that it will remain that way even if we stop querying 
 it and stop replication and then wait for over 20 minutes. The only way 
 to fix the problem at that point is to restart tomcat.
 
 Looking at slow queries around the time of the incident they don't look 
 particularly bad - they're predominantly filter queries running under 
 dismax and there doesn't seem to be anything unusual about them.
 
 The index file is about 266G and has 30G of disk free. The machine has 
 50G of RAM and is running with -Xmx35G.
 
 Looking at the processes running it appears to be the main Java thread 
 that's CPU bound, not the child threads. 
 
 Stracing the process gives a lot of brk instructions (presumably some 
 sort of wait loop) with occasional blips of: 
 
 
 mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
 325, {1294683789, 614186000}, ) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
 mmap(0x7fc2e023, 121962496, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fc2e023
 mmap(0x7fbca58e, 237568, PROT_NONE, 
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
 0x7fbca58e
 
 Any ideas about what's happening and if there's anyway to mitigate it? 
 If the box at least recovered then I could run another slave and load 
 balance between them working on the principle that the second box 
 would pick up the slack whilst the first box restabilised but, as it is, 
 that's not reliable.
 
 Thanks,
 
 Simon
 


Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow
On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
 This sounds like it could be garbage collection related, especially 
 with a heap that large.  Depending on your jvm tuning, a FGC could 
 take quite a while, effectively 'pausing' the JVM.
 
 Have you looked at something like jstat -gcutil or similar to monitor 
 the garbage collection?

I think you may have hit the nail on the head. 

Having checked the configuration again I noticed that the -server flag 
didn't appear to be present in the options passed to Java (I'm convinced 
it used to be there). As I understand it, this would mean that the 
Parallel GC wouldn't be implicitly enabled.

If that's true then that's a definite strong candidate for causing the 
root process and only the root process to peg a single CPU.

Anybody have any experience of the differences between

-XX:+UseParallelGC 

and

-XX:+UseConcMarkSweepGC with -XX:+UseParNewGC

?

I believe -XX:+UseParallelGC  is the default with -server so I suppose 
that's a good place to start but I'd appreciate any anecdotes or 
experiences.





Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread François Schiettecatte
This reminded me of a situation I ran into in the past where the JVM was being 
rendered useless because it was calling FGC repeatedly. Effectively what was 
going on is that a very large array was allocated which swamped the JVM memory 
and caused it to trash, much like an OS.

Here are some links which will help (at least they helped me):

http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html  (you need to 
read this one)

http://java.sun.com/performance/reference/whitepapers/tuning.html   (and 
this one).

http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

http://java.sun.com/performance/jvmstat/

http://blogs.sun.com/watt/resource/jvm-options-list.html

jstat is also very good for seeing what is going on in the JVM. I also recall 
there was a way to trace GC in the JVM but cant recall how off the top of my 
head, maybe it was a JVM option.

Hope this helps.

Cheers

François


On Jan 10, 2011, at 5:13 PM, Simon Wistow wrote:

 On Mon, Jan 10, 2011 at 01:56:27PM -0500, Brian Burke said:
 This sounds like it could be garbage collection related, especially 
 with a heap that large.  Depending on your jvm tuning, a FGC could 
 take quite a while, effectively 'pausing' the JVM.
 
 Have you looked at something like jstat -gcutil or similar to monitor 
 the garbage collection?
 
 I think you may have hit the nail on the head. 
 
 Having checked the configuration again I noticed that the -server flag 
 didn't appear to be present in the options passed to Java (I'm convinced 
 it used to be there). As I understand it, this would mean that the 
 Parallel GC wouldn't be implicitly enabled.
 
 If that's true then that's a definite strong candidate for causing the 
 root process and only the root process to peg a single CPU.
 
 Anybody have any experience of the differences between
 
 -XX:+UseParallelGC 
 
 and
 
 -XX:+UseConcMarkSweepGC with -XX:+UseParNewGC
 
 ?
 
 I believe -XX:+UseParallelGC  is the default with -server so I suppose 
 that's a good place to start but I'd appreciate any anecdotes or 
 experiences.
 
 
 



Re: Box occasionally pegs one cpu at 100%

2011-01-10 Thread Simon Wistow
On Mon, Jan 10, 2011 at 05:58:42PM -0500, François Schiettecatte said:
 http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html(you 
 need to read this one)
 
 http://java.sun.com/performance/reference/whitepapers/tuning.html (and 
 this one).

Yeah, I have these two pages bookmarked :)

 jstat is also very good for seeing what is going on in the JVM. I also 
 recall there was a way to trace GC in the JVM but cant recall how off 
 the top of my head, maybe it was a JVM option.

You can use -XX:+PrintGC and -XX:+PrintGCDetail (and 
-XX:+PrintGCTimeStamps) as well as -Xloggc:gc.log to log to a file.

I'm also finding NewRelic's RPM system great for monitoring Solr - the 
integration is really good, I give it two thumbs up.