If you are after only visualising GC, there are several tools that you can download or upload logs to visualise. If you would like to monitor all host/solr/jvm, Sematext’s SPM also comes in on-premises version, where you install and host your own monitoring infrastructure: https://sematext.com/spm/#on-premises <https://sematext.com/spm/#on-premises>
Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 28 Feb 2018, at 10:53, 苗海泉 <mseaspr...@gmail.com> wrote: > > Thanks for your detailed advice, the monitor product you are talking about > is good, but our solr system is running on a private network and seems to > be unusable at all, with no single downloadable application for analyzing > specific gc logs. > > 2018-02-28 16:57 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com > <mailto:emir.arnauto...@sematext.com>>: > >> Hi, >> I would start with following: >> 1. have dedicated nodes for ZK ensemble - those do not have to be powerful >> nodes (maybe 2-4 cores and 8GB RAM) >> 2. reduce heap size to value below margin where JVM can use compressed >> oops - 31GB should be safe size >> 3. shard collection to all nodes >> 4. increase rollover interval to 2h so you keep shard size/number as it is >> today. >> 5. experiment with slightly larger rollover intervals (e.g. 3h) if query >> latency is still acceptable. That will result in less shards that are >> slightly larger. >> >> In any case monitor your cluster to see how changes affect it. Not sure >> what you currently use for monitoring, but manual scanning of GC logs is >> not fun. You can check out our monitoring tool if you don’t have one or if >> it does not give you enough visibility: https://sematext.com/spm/ < >> https://sematext.com/spm/ <https://sematext.com/spm/>> >> >> HTH, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 28 Feb 2018, at 02:42, 苗海泉 <mseaspr...@gmail.com> wrote: >>> >>> Thank you, I read under the memory footprint, I set 75% recovery, memory >>> occupancy at about 76%, the other we zookeeper not on a dedicated server, >>> perhaps because of this cause instability. >>> >>> What else do you recommend for me to check? >>> >>> 2018-02-27 22:37 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com >>> : >>> >>>> This does not show much: only that your heap is around 75% (24-25GB). I >>>> was thinking that you should compare metrics (heap/GC as well) when >> running >>>> on without issues and when running with issues and see if something can >> be >>>> concluded. >>>> About instability: Do you run ZK on dedicated nodes? >>>> >>>> Emir >>>> -- >>>> Monitoring - Log Management - Alerting - Anomaly Detection >>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >>>> >>>> >>>> >>>>> On 27 Feb 2018, at 14:43, 苗海泉 <mseaspr...@gmail.com> wrote: >>>>> >>>>> Thank you, we were 49 shard 49 nodes, but later found that in this >> case, >>>>> often disconnect between solr and zookeepr, zookeeper too many nodes >>>> caused >>>>> solr instability, so reduced to 25 A follow-up performance can not keep >>>> up >>>>> also need to increase back. >>>>> >>>>> Very slow when solr and zookeeper not found any errors, just build the >>>>> index slow, automatic commit inside the log display is slow, but the >> main >>>>> reason may not lie in the commit place. >>>>> >>>>> I am sorry, I do not know how to look at the utilization of java heap, >>>>> through the gc log, gc time is not long, I posted the log: >>>>> >>>>> >>>>> {Heap before GC invocations=1144021 (full 72): >>>>> garbage-first heap total 33554432K, used 26982419K >> [0x00007f1478000000, >>>>> 0x00007f1478808000, 0x00007f1c78000000) >>>>> region size 8192K, 204 young (1671168K), 26 survivors (212992K) >>>>> Metaspace used 41184K, capacity 41752K, committed 67072K, >> reserved >>>>> 67584K >>>>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation >>>> Pause) >>>>> (young) >>>>> Desired survivor size 109051904 bytes, new threshold 1 (max 15) >>>>> - age 1: 113878760 bytes, 113878760 total >>>>> - age 2: 21264744 bytes, 135143504 total >>>>> - age 3: 17020096 bytes, 152163600 total >>>>> - age 4: 26870864 bytes, 179034464 total >>>>> , 0.0579794 secs] >>>>> [Parallel Time: 46.9 ms, GC Workers: 18] >>>>> [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max: >>>>> 4668016046.4, Diff: 0.3] >>>>> [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9, >>>>> Sum: 116.9] >>>>> [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum: >> 62.0] >>>>> [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum: >>>> 113] >>>>> [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5] >>>>> [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, >>>>> Sum: 0.0] >>>>> [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum: >>>>> 428.1] >>>>> [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum: >>>>> 228.9] >>>>> [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: >>>> 18] >>>>> [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, >> Sum: >>>>> 1.2] >>>>> [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3, >>>>> Sum: 838.0] >>>>> [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max: >>>>> 4668016092.8, Diff: 0.0] >>>>> [Code Root Fixup: 0.2 ms] >>>>> [Code Root Purge: 0.0 ms] >>>>> [Clear CT: 0.3 ms] >>>>> [Other: 10.7 ms] >>>>> [Choose CSet: 0.0 ms] >>>>> [Ref Proc: 5.9 ms] >>>>> [Ref Enq: 0.2 ms] >>>>> [Redirty Cards: 0.2 ms] >>>>> [Humongous Register: 2.2 ms] >>>>> [Humongous Reclaim: 0.4 ms] >>>>> [Free CSet: 0.4 ms] >>>>> [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap: >>>>> 25.7G(32.0G)->24.3G(32.0G)] >>>>> Heap after GC invocations=1144022 (full 72): >>>>> garbage-first heap total 33554432K, used 25489656K >> [0x00007f1478000000, >>>>> 0x00007f1478808000, 0x00007f1c78000000) >>>>> region size 8192K, 10 young (81920K), 10 survivors (81920K) >>>>> Metaspace used 41184K, capacity 41752K, committed 67072K, >> reserved >>>>> 67584K >>>>> } >>>>> [Times: user=0.84 sys=0.01, real=0.05 secs] >>>>> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which >>>> application >>>>> threads were stopped: 0.0661383 seconds, Stopping threads took: >> 0.0004141 >>>>> seconds >>>>> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end, >>>>> 2.5757061 secs] >>>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark >>>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508 >>>>> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc, >> 0.0277818 >>>>> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102 >>>>> secs], 0.0704296 secs] >>>>> [Times: user=0.85 sys=0.04, real=0.07 secs] >>>>> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which >>>> application >>>>> threads were stopped: 0.0785762 seconds, Stopping threads took: >> 0.0006159 >>>>> seconds >>>>> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G), >>>>> 0.0391915 secs] >>>>> [Times: user=0.64 sys=0.00, real=0.04 secs] >>>>> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which >>>> application >>>>> threads were stopped: 0.0470020 seconds, Stopping threads took: >> 0.0001684 >>>>> seconds >>>>> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which >>>> application >>>>> threads were stopped: 0.0074829 seconds, Stopping threads took: >> 0.0004834 >>>>> seconds >>>>> {Heap before GC invocations=1144023 (full 72): >>>>> garbage-first heap total 33554432K, used 27078904K >> [0x00007f1478000000, >>>>> 0x00007f1478808000, 0x00007f1c78000000) >>>>> region size 8192K, 204 young (1671168K), 10 survivors (81920K) >>>>> Metaspace used 41184K, capacity 41752K, committed 67072K, >> reserved >>>>> 67584K >>>>> 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation >>>> Pause) >>>>> (young) >>>>> Desired survivor size 109051904 bytes, new threshold 15 (max 15) >>>>> - age 1: 47719032 bytes, 47719032 total >>>>> , 0.0554183 secs] >>>>> [Parallel Time: 48.0 ms, GC Workers: 18] >>>>> [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max: >>>>> 4668018329.3, Diff: 0.3] >>>>> [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6, >>>>> Sum: 103.0] >>>>> [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum: >>>>> 257.6] >>>>> [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum: >>>> 314] >>>>> [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5] >>>>> [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, >>>>> Sum: 0.0] >>>>> [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum: >>>>> 196.9] >>>>> [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum: >>>>> 299.1] >>>>> [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: >>>> 18] >>>>> [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, >> Sum: >>>>> 0.5] >>>>> [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3, >>>>> Sum: 857.6] >>>>> [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max: >>>>> 4668018376.8, Diff: 0.0] >>>>> [Code Root Fixup: 0.2 ms] >>>>> [Code Root Purge: 0.0 ms] >>>>> [Clear CT: 0.2 ms] >>>>> [Other: 7.1 ms] >>>>> [Choose CSet: 0.0 ms] >>>>> [Ref Proc: 2.3 ms] >>>>> [Ref Enq: 0.2 ms] >>>>> [Redirty Cards: 0.2 ms] >>>>> [Humongous Register: 2.2 ms] >>>>> [Humongous Reclaim: 0.4 ms] >>>>> [Free CSet: 0.4 ms] >>>>> [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap: >>>>> 25.8G(32.0G)->24.4G(32.0G)] >>>>> Heap after GC invocations=1144024 (full 72): >>>>> garbage-first heap total 33554432K, used 25550050K >> [0x00007f1478000000, >>>>> 0x00007f1478808000, 0x00007f1c78000000) >>>>> region size 8192K, 18 young (147456K), 18 survivors (147456K) >>>>> Metaspace used 41184K, capacity 41752K, committed 67072K, >> reserved >>>>> 67584K >>>>> } >>>>> [Times: user=0.82 sys=0.00, real=0.05 secs] >>>>> >>>>> >>>>> >>>>> >>>>> 2018-02-27 20:58 GMT+08:00 Emir Arnautović < >> emir.arnauto...@sematext.com >>>>> : >>>>> >>>>>> Ah, so there are ~560 shards per node and not all nodes are indexing >> at >>>>>> the same time. Why is that? You can have better throughput if indexing >>>> on >>>>>> all nodes. If happy with shard size, you can create new collection >> with >>>> 49 >>>>>> shards every 2h and have everything the same and index on all nodes. >>>>>> >>>>>> Back to main question: what is the heap utilisation? When you restart >>>> node >>>>>> what is heap utilisation? Do you see any errors in your logs? Do you >> see >>>>>> any errors in ZK logs? >>>>>> >>>>>> Emir >>>>>> -- >>>>>> Monitoring - Log Management - Alerting - Anomaly Detection >>>>>> Solr & Elasticsearch Consulting Support Training - >> http://sematext.com/ >>>>>> >>>>>> >>>>>> >>>>>>> On 27 Feb 2018, at 13:22, 苗海泉 <mseaspr...@gmail.com> wrote: >>>>>>> >>>>>>> Thanks for you reply again. >>>>>>> I just said that you may have some misunderstanding, we have 49 solr >>>>>> nodes, >>>>>>> each collection has 25 shards, each shard has only one replica of the >>>>>> data, >>>>>>> there is no copy, and I reduce the part of the cache. If you need the >>>>>>> metric data, I can check Come out to tell you, in addition we are >> only >>>>>>> additional system, there will not be any change action. >>>>>>> >>>>>>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović < >>>> emir.arnauto...@sematext.com >>>>>>> : >>>>>>> >>>>>>>> Hi, >>>>>>>> It is hard to tell without looking more into your metrics. It seems >> to >>>>>> me >>>>>>>> that you are reaching limits of your cluster. I would doublecheck if >>>>>> memory >>>>>>>> is the issue. If I got it right, you have ~1120 shards per node. It >>>>>> takes >>>>>>>> some heap just to keep them open. If you have some caches enabled >> and >>>>>> if it >>>>>>>> is append only system, old shards will keep caches until reloaded. >>>>>>>> Probably will not make much diff, but with 25x2=50 shards and 49 >>>> nodes, >>>>>>>> one node will need to handle double indexing load. >>>>>>>> >>>>>>>> Emir >>>>>>>> -- >>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection >>>>>>>> Solr & Elasticsearch Consulting Support Training - >>>> http://sematext.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 27 Feb 2018, at 12:54, 苗海泉 <mseaspr...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> In addition, we found that the rate was normal when the number of >>>>>>>>> collections was kept below 936 and the speed was slower and slower >> at >>>>>>>> 984. >>>>>>>>> Therefore, we could only temporarily delete the older collection, >> but >>>>>> now >>>>>>>>> we need more Online collection, there has been no good way to >> confuse >>>>>> us >>>>>>>>> for a long time, very much hope to give a solution to the problem >> of >>>>>>>> ideas, >>>>>>>>> greatly appreciated >>>>>>>>> >>>>>>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <mseaspr...@gmail.com>: >>>>>>>>> >>>>>>>>>> Thank you for reply. >>>>>>>>>> One collection has 25 shard one replica, one solr node has about >> 5T >>>> on >>>>>>>>>> desk. >>>>>>>>>> GC is checked ,and modify as follow : >>>>>>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m " >>>>>>>>>> GC_TUNE=" \ >>>>>>>>>> -XX:+UseG1GC \ >>>>>>>>>> -XX:+PerfDisableSharedMem \ >>>>>>>>>> -XX:+ParallelRefProcEnabled \ >>>>>>>>>> -XX:G1HeapRegionSize=8m \ >>>>>>>>>> -XX:MaxGCPauseMillis=250 \ >>>>>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \ >>>>>>>>>> -XX:+UseLargePages \ >>>>>>>>>> -XX:+AggressiveOpts \ >>>>>>>>>> -XX:+UseLargePages" >>>>>>>>>> >>>>>>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović < >>>>>>>> emir.arnauto...@sematext.com>: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> To get more complete picture, can you tell us how many >>>>>> shards/replicas >>>>>>>> do >>>>>>>>>>> you have per collection? Also what is index size on disk? Did you >>>>>>>> check GC? >>>>>>>>>>> >>>>>>>>>>> BTW, using 32GB heap prevents you from using compressed oops, >>>>>> resulting >>>>>>>>>>> in less memory available than 31GB. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Emir >>>>>>>>>>> -- >>>>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection >>>>>>>>>>> Solr & Elasticsearch Consulting Support Training - >>>>>>>> http://sematext.com/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <mseaspr...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I encountered a more serious problem in the process of using >> solr. >>>>>> We >>>>>>>>>>> use >>>>>>>>>>>> the solr version is 6.0, our daily amount of data is about 500 >>>>>> billion >>>>>>>>>>>> documents, create a collection every hour, the online collection >>>> of >>>>>>>> more >>>>>>>>>>>> than a thousand, 49 solr nodes. If the collection in less than >>>> 800, >>>>>>>> the >>>>>>>>>>>> speed is still very fast, if the collection of the number of >> 1100 >>>> or >>>>>>>> so, >>>>>>>>>>>> the construction of solr index will drop sharply, one of the >>>>>> original >>>>>>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few >>>>>> hundred >>>>>>>> or >>>>>>>>>>>> even tens of TPS, who have encountered a similar situation, >> there >>>> is >>>>>>>> no >>>>>>>>>>>> good idea to find this issue. By the way, solr a node memory we >>>>>>>> assigned >>>>>>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is >>>> no >>>>>>>>>>>> problem, belong to the normal state. Which friend encountered a >>>>>>>> similar >>>>>>>>>>>> problem, please inform the solution, thank you very much. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> ============================== >>>>>>>>>> 联创科技 >>>>>>>>>> 知行如一 >>>>>>>>>> ============================== >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> ============================== >>>>>>>>> 联创科技 >>>>>>>>> 知行如一 >>>>>>>>> ============================== >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ============================== >>>>>>> 联创科技 >>>>>>> 知行如一 >>>>>>> ============================== >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> ============================== >>>>> 联创科技 >>>>> 知行如一 >>>>> ============================== >>>> >>>> >>> >>> >>> -- >>> ============================== >>> 联创科技 >>> 知行如一 >>> ============================== >> >> > > > -- > ============================== > 联创科技 > 知行如一 > ==============================