Re: SOLR4 cluster - strange CPU spike on slave

John Nielsen Wed, 05 Dec 2012 23:29:28 -0800

I'm not sure I understand why this is important. Too much memory would just
be unused.


This is what the heap looks now:

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 17179869184 (16384.0MB)
   NewSize          = 21757952 (20.75MB)
   MaxNewSize       = 283508736 (270.375MB)
   OldSize          = 65404928 (62.375MB)
   NewRatio         = 7
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 176160768 (168.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 255197184 (243.375MB)
   used     = 108828496 (103.78694152832031MB)
   free     = 146368688 (139.5880584716797MB)
   42.644865548359654% used
Eden Space:
   capacity = 226885632 (216.375MB)
   used     = 83498424 (79.63030242919922MB)
   free     = 143387208 (136.74469757080078MB)
   36.80198841326365% used
>From Space:
   capacity = 28311552 (27.0MB)
   used     = 25330072 (24.156639099121094MB)
   free     = 2981480 (2.8433609008789062MB)
   89.46903370044849% used
To Space:
   capacity = 28311552 (27.0MB)
   used     = 0 (0.0MB)
   free     = 28311552 (27.0MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 16896360448 (16113.625MB)
   used     = 12452710200 (11875.829887390137MB)
   free     = 4443650248 (4237.795112609863MB)
   73.70054775005708% used
Perm Generation:
   capacity = 70578176 (67.30859375MB)
   used     = 37652032 (35.90777587890625MB)
   free     = 32926144 (31.40081787109375MB)
   53.347981109627995% used



Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk



On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> If this is caused by index segment merging you should be able to see that
> very clearly on the Index report in SPM, where you would see sudden graph
> movement at the time of spike and corresponding to CPU and disk activity.
> I think uncommenting that infostream in solrconfig would also show it.
>
> Otis
> --
> SOLR Performance Monitoring - http://sematext.com/spm
> On Nov 28, 2012 9:20 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
> > Am I reading this right? All you're doing on varnish1 is replicating to
> it?
> > You're not searching or indexing? I'm sure I'm misreading this.....
> >
> >
> > "The spike, which only lasts for a couple of minutes, sends the disks
> > racing" This _sounds_ suspiciously like segment merging, especially the
> > "disks racing" bit. Or possibly replication. Neither of which make much
> > sense. But is there any chance that somehow multiple commits are being
> > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > either.
> >
> > And the whole bit about nothing going to the logs is just bizarre. I'm
> > tempted to claim hardware gremlins, especially if you see nothing similar
> > on varnish2. Or some other process is pegging the machine. All of which
> is
> > a way of saying "I have no idea"....
> >
> > Yours in bewilderment,
> > Erick
> >
> >
> >
> > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <j...@mcb.dk> wrote:
> >
> > > I apologize for the late reply.
> > >
> > > The query load is more or less stable during the spikes. There are
> always
> > > fluctuations, but nothing on the order of magnitude that could explain
> > this
> > > spike. In fact, the latest spike occured last night when there were
> > almost
> > > noone using it.
> > >
> > > To test a hunch of mine, I tried to deactivate all caches by commenting
> > out
> > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> it
> > > has anything to do with cache warming or hits/misses or anything of the
> > > sort.
> > >
> > > One interesting thing GC though. This is our latest spike with cpu load
> > > (this server has 8 cores, so a load higher than 8 is potentially
> > > troublesome):
> > >
> > > 2012.Nov.27 19:58:18    2.27
> > > 2012.Nov.27 19:57:17    4.06
> > > 2012.Nov.27 19:56:18    8.95
> > > 2012.Nov.27 19:55:17    19.97
> > > 2012.Nov.27 19:54:17    32.27
> > > 2012.Nov.27 19:53:18    1.67
> > > 2012.Nov.27 19:52:17    1.6
> > > 2012.Nov.27 19:51:18    1.77
> > > 2012.Nov.27 19:50:17    1.89
> > >
> > > This is what the GC was doing around that time:
> > >
> > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > 4777586K->277641K(4969216K)]
> > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > 4749769K->325171K(5068096K)]
> > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > 4911603K->306181K(5071168K)]
> > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > 4892613K->376175K(5075328K)]
> > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > 4972847K->271468K(4868160K)]
> > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> sys=0.01,
> > > real=0.07 secs]
> > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > 4868140K->336421K(5090944K)]
> > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> sys=0.01,
> > > real=0.08 secs]
> > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > 4950373K->340513K(5092864K)]
> > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > 4954465K->480488K(4952000K)]
> > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> sys=0.09,
> > > real=0.19 secs]
> > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > 4951976K->560434K(5031936K)]
> > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> sys=0.12,
> > > real=0.35 secs]
> > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > 560434K->0K(5031936K)]
> > > [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> > > sys=0.00, real=5.23 secs]
> > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > 4471488K->87249K(4647168K)]
> > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> sys=0.00,
> > > real=0.04 secs]
> > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > 4173969K->171306K(4839552K)]
> > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> sys=0.00,
> > > real=0.05 secs]
> > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > 4258026K->156193K(4802368K)]
> > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > 4254753K->156028K(4826880K)]
> > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > 4254588K->413148K(4860288K)]
> > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> sys=0.01,
> > > real=0.12 secs]
> > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > 4564892K->412801K(4866432K)]
> > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> sys=0.00,
> > > real=0.11 secs]
> > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > 4564545K->434702K(4870592K)]
> > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> sys=0.00,
> > > real=0.12 secs]
> > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > 4591822K->500171K(4657344K)]
> > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> sys=0.00,
> > > real=0.16 secs]
> > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > 4657291K->453204K(4849536K)]
> > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > 4560404K->533101K(4640320K)]
> > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> sys=0.00,
> > > real=0.13 secs]
> > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > 4640301K->480969K(4822720K)]
> > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > 4534281K->572295K(4625664K)]
> > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > 4625607K->486964K(4800640K)]
> > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> sys=0.00,
> > > real=0.12 secs]
> > >
> > > So the GC was doing a full collection around that time, but that only
> > > accounts for around 5 secs. Not the couple of minutes the spike lasts.
> > >
> > > I am really at a complete loss as to what this could be. Google has not
> > > given me any clues.
> > >
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > p...@mcb.dk
> > > www.mcb.dk
> > >
> > >
> > >
> > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > otis.gospodne...@gmail.com> wrote:
> > >
> > > > Strange indeed. What about query load/ayes during that time? What
> about
> > > GC?
> > > > And does cache hit rate drop?
> > > >
> > > > Otis
> > > > --
> > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <j...@mcb.dk> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are seeing a strange CPU spike on one of our solr4 servers which
> > we
> > > > are
> > > > > unable to explain. The spike, which only lasts for a couple of
> > minutes,
> > > > > sends the disks racing. This happens a few times a times a day.
> This
> > is
> > > > > what the load looks like:
> > > > >
> > > > > 2012.Nov.14 13:37:17    2.77
> > > > > 2012.Nov.14 13:36:17    3.65
> > > > > 2012.Nov.14 13:35:18    3.92
> > > > > 2012.Nov.14 13:34:17    3.95
> > > > > 2012.Nov.14 13:33:18    6.56
> > > > > 2012.Nov.14 13:32:17    10.79
> > > > > 2012.Nov.14 13:31:17    24.38
> > > > > 2012.Nov.14 13:30:17    63.35
> > > > > 2012.Nov.14 13:29:17    24.68
> > > > > 2012.Nov.14 13:28:17    2.44
> > > > > 2012.Nov.14 13:27:18    3.51
> > > > > 2012.Nov.14 13:26:17    5.26
> > > > > 2012.Nov.14 13:25:18    5.71
> > > > > 2012.Nov.14 13:24:17    2.7
> > > > >
> > > > > The problem is that out of a 3 minute spike, I get about 40 seconds
> > of
> > > > > silence in the logs. This log usually adds like a thousand lines
> > every
> > > > > second. Not being able to communicate with the server for this
> long,
> > > > breaks
> > > > > our use case.
> > > > >
> > > > > We have two servers, varnish01 and varnish02. We used to feed data
> to
> > > > > varnish02, replicate it to varnish01 where the data is then read
> > from.
> > > > When
> > > > > we discovered this issue, we moved all traffic to varnish02 so that
> > > data
> > > > is
> > > > > being replicated to varnish01, but other than that, gets zero
> > traffic.
> > > > The
> > > > > spike did not disappear.
> > > > >
> > > > > The spike we are seeing is on varnish01 only.
> > > > >
> > > > > Please note that our use case requires us to continuously feed
> large
> > > > > amounts of data from our main system in the order of up to 1.000
> > > > registers
> > > > > every minute. Our solrconfig.xml is attached.
> > > > >
> > > > > Has anyone seen this phenomenon before?
> > > > >
> > > > > Med venlig hilsen / Best regards
> > > > >
> > > > > *John Nielsen*
> > > > > Programmer
> > > > >
> > > > >
> > > > >
> > > > > *MCB A/S*
> > > > > Enghaven 15
> > > > > DK-7500 Holstebro
> > > > >
> > > > > Kundeservice: +45 9610 2824
> > > > > p...@mcb.dk
> > > > > www.mcb.dk
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Reply via email to