Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Emir Arnautović Wed, 28 Feb 2018 02:02:18 -0800

If you are after only visualising GC, there are several tools that you can 
download or upload logs to visualise. If you would like to monitor all 
host/solr/jvm, Sematext’s SPM also comes in on-premises  version, where you 
install and host your own monitoring infrastructure: 
https://sematext.com/spm/#on-premises <https://sematext.com/spm/#on-premises>


Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 28 Feb 2018, at 10:53, 苗海泉 <mseaspr...@gmail.com> wrote:
> 
> Thanks for your detailed advice, the monitor product you are talking about
> is good, but our solr system is running on a private network and seems to
> be unusable at all, with no single downloadable application for analyzing
> specific gc logs.
> 
> 2018-02-28 16:57 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com 
> <mailto:emir.arnauto...@sematext.com>>:
> 
>> Hi,
>> I would start with following:
>> 1. have dedicated nodes for ZK ensemble - those do not have to be powerful
>> nodes (maybe 2-4 cores and 8GB RAM)
>> 2. reduce heap size to value below margin where JVM can use compressed
>> oops - 31GB should be safe size
>> 3. shard collection to all nodes
>> 4. increase rollover interval to 2h so you keep shard size/number as it is
>> today.
>> 5. experiment with slightly larger rollover intervals (e.g. 3h) if query
>> latency is still acceptable. That will result in less shards that are
>> slightly larger.
>> 
>> In any case monitor your cluster to see how changes affect it. Not sure
>> what you currently use for monitoring, but manual scanning of GC logs is
>> not fun. You can check out our monitoring tool if you don’t have one or if
>> it does not give you enough visibility: https://sematext.com/spm/ <
>> https://sematext.com/spm/ <https://sematext.com/spm/>>
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 28 Feb 2018, at 02:42, 苗海泉 <mseaspr...@gmail.com> wrote:
>>> 
>>> Thank you, I read under the memory footprint, I set 75% recovery, memory
>>> occupancy at about 76%, the other we zookeeper not on a dedicated server,
>>> perhaps because of this cause instability.
>>> 
>>> What else do you recommend for me to check?
>>> 
>>> 2018-02-27 22:37 GMT+08:00 Emir Arnautović <emir.arnauto...@sematext.com
>>> :
>>> 
>>>> This does not show much: only that your heap is around 75% (24-25GB). I
>>>> was thinking that you should compare metrics (heap/GC as well) when
>> running
>>>> on without issues and when running with issues and see if something can
>> be
>>>> concluded.
>>>> About instability: Do you run ZK on dedicated nodes?
>>>> 
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 27 Feb 2018, at 14:43, 苗海泉 <mseaspr...@gmail.com> wrote:
>>>>> 
>>>>> Thank you, we were 49 shard 49 nodes, but later found that in this
>> case,
>>>>> often disconnect between solr and zookeepr, zookeeper too many nodes
>>>> caused
>>>>> solr instability, so reduced to 25 A follow-up performance can not keep
>>>> up
>>>>> also need to increase back.
>>>>> 
>>>>> Very slow when solr and zookeeper not found any errors, just build the
>>>>> index slow, automatic commit inside the log display is slow, but the
>> main
>>>>> reason may not lie in the commit place.
>>>>> 
>>>>> I am sorry, I do not know how to look at the utilization of java heap,
>>>>> through the gc log, gc time is not long, I posted the log:
>>>>> 
>>>>> 
>>>>> {Heap before GC invocations=1144021 (full 72):
>>>>> garbage-first heap   total 33554432K, used 26982419K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 204 young (1671168K), 26 survivors (212992K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> 2018-02-27T21:43:01.793+0800: 4668016.044: [GC pause (G1 Evacuation
>>>> Pause)
>>>>> (young)
>>>>> Desired survivor size 109051904 bytes, new threshold 1 (max 15)
>>>>> - age   1:  113878760 bytes,  113878760 total
>>>>> - age   2:   21264744 bytes,  135143504 total
>>>>> - age   3:   17020096 bytes,  152163600 total
>>>>> - age   4:   26870864 bytes,  179034464 total
>>>>> , 0.0579794 secs]
>>>>> [Parallel Time: 46.9 ms, GC Workers: 18]
>>>>>    [GC Worker Start (ms): Min: 4668016046.1, Avg: 4668016046.3, Max:
>>>>> 4668016046.4, Diff: 0.3]
>>>>>    [Ext Root Scanning (ms): Min: 2.4, Avg: 6.5, Max: 46.3, Diff: 43.9,
>>>>> Sum: 116.9]
>>>>>    [Update RS (ms): Min: 0.0, Avg: 3.4, Max: 6.0, Diff: 6.0, Sum:
>> 62.0]
>>>>>       [Processed Buffers: Min: 0, Avg: 6.3, Max: 16, Diff: 16, Sum:
>>>> 113]
>>>>>    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>>>>    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>> Sum: 0.0]
>>>>>    [Object Copy (ms): Min: 0.1, Avg: 23.8, Max: 25.5, Diff: 25.5, Sum:
>>>>> 428.1]
>>>>>    [Termination (ms): Min: 0.0, Avg: 12.7, Max: 13.5, Diff: 13.5, Sum:
>>>>> 228.9]
>>>>>       [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>>>> 18]
>>>>>    [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4,
>> Sum:
>>>>> 1.2]
>>>>>    [GC Worker Total (ms): Min: 46.4, Avg: 46.6, Max: 46.7, Diff: 0.3,
>>>>> Sum: 838.0]
>>>>>    [GC Worker End (ms): Min: 4668016092.8, Avg: 4668016092.8, Max:
>>>>> 4668016092.8, Diff: 0.0]
>>>>> [Code Root Fixup: 0.2 ms]
>>>>> [Code Root Purge: 0.0 ms]
>>>>> [Clear CT: 0.3 ms]
>>>>> [Other: 10.7 ms]
>>>>>    [Choose CSet: 0.0 ms]
>>>>>    [Ref Proc: 5.9 ms]
>>>>>    [Ref Enq: 0.2 ms]
>>>>>    [Redirty Cards: 0.2 ms]
>>>>>    [Humongous Register: 2.2 ms]
>>>>>    [Humongous Reclaim: 0.4 ms]
>>>>>    [Free CSet: 0.4 ms]
>>>>> [Eden: 1424.0M(1424.0M)->0.0B(1552.0M) Survivors: 208.0M->80.0M Heap:
>>>>> 25.7G(32.0G)->24.3G(32.0G)]
>>>>> Heap after GC invocations=1144022 (full 72):
>>>>> garbage-first heap   total 33554432K, used 25489656K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 10 young (81920K), 10 survivors (81920K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> }
>>>>> [Times: user=0.84 sys=0.01, real=0.05 secs]
>>>>> 2018-02-27T21:43:01.851+0800: 4668016.102: Total time for which
>>>> application
>>>>> threads were stopped: 0.0661383 seconds, Stopping threads took:
>> 0.0004141
>>>>> seconds
>>>>> 2018-02-27T21:43:02.092+0800: 4668016.343: [GC concurrent-mark-end,
>>>>> 2.5757061 secs]
>>>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [GC remark
>>>>> 2018-02-27T21:43:02.100+0800: 4668016.351: [Finalize Marking, 0.0016508
>>>>> secs] 2018-02-27T21:43:02.102+0800: 4668016.352: [GC ref-proc,
>> 0.0277818
>>>>> secs] 2018-02-27T21:43:02.129+0800: 4668016.380: [Unloading, 0.0118102
>>>>> secs], 0.0704296 secs]
>>>>> [Times: user=0.85 sys=0.04, real=0.07 secs]
>>>>> 2018-02-27T21:43:02.171+0800: 4668016.422: Total time for which
>>>> application
>>>>> threads were stopped: 0.0785762 seconds, Stopping threads took:
>> 0.0006159
>>>>> seconds
>>>>> 2018-02-27T21:43:02.178+0800: 4668016.429: [GC cleanup 24G->24G(32G),
>>>>> 0.0391915 secs]
>>>>> [Times: user=0.64 sys=0.00, real=0.04 secs]
>>>>> 2018-02-27T21:43:02.218+0800: 4668016.469: Total time for which
>>>> application
>>>>> threads were stopped: 0.0470020 seconds, Stopping threads took:
>> 0.0001684
>>>>> seconds
>>>>> 2018-02-27T21:43:02.540+0800: 4668016.791: Total time for which
>>>> application
>>>>> threads were stopped: 0.0074829 seconds, Stopping threads took:
>> 0.0004834
>>>>> seconds
>>>>> {Heap before GC invocations=1144023 (full 72):
>>>>> garbage-first heap   total 33554432K, used 27078904K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 204 young (1671168K), 10 survivors (81920K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> 2018-02-27T21:43:04.076+0800: 4668018.326: [GC pause (G1 Evacuation
>>>> Pause)
>>>>> (young)
>>>>> Desired survivor size 109051904 bytes, new threshold 15 (max 15)
>>>>> - age   1:   47719032 bytes,   47719032 total
>>>>> , 0.0554183 secs]
>>>>> [Parallel Time: 48.0 ms, GC Workers: 18]
>>>>>    [GC Worker Start (ms): Min: 4668018329.0, Avg: 4668018329.1, Max:
>>>>> 4668018329.3, Diff: 0.3]
>>>>>    [Ext Root Scanning (ms): Min: 2.9, Avg: 5.7, Max: 47.4, Diff: 44.6,
>>>>> Sum: 103.0]
>>>>>    [Update RS (ms): Min: 0.0, Avg: 14.3, Max: 16.2, Diff: 16.2, Sum:
>>>>> 257.6]
>>>>>       [Processed Buffers: Min: 0, Avg: 17.4, Max: 22, Diff: 22, Sum:
>>>> 314]
>>>>>    [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.5]
>>>>>    [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>>> Sum: 0.0]
>>>>>    [Object Copy (ms): Min: 0.1, Avg: 10.9, Max: 11.9, Diff: 11.8, Sum:
>>>>> 196.9]
>>>>>    [Termination (ms): Min: 0.0, Avg: 16.6, Max: 17.6, Diff: 17.6, Sum:
>>>>> 299.1]
>>>>>       [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum:
>>>> 18]
>>>>>    [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0,
>> Sum:
>>>>> 0.5]
>>>>>    [GC Worker Total (ms): Min: 47.5, Avg: 47.6, Max: 47.8, Diff: 0.3,
>>>>> Sum: 857.6]
>>>>>    [GC Worker End (ms): Min: 4668018376.7, Avg: 4668018376.8, Max:
>>>>> 4668018376.8, Diff: 0.0]
>>>>> [Code Root Fixup: 0.2 ms]
>>>>> [Code Root Purge: 0.0 ms]
>>>>> [Clear CT: 0.2 ms]
>>>>> [Other: 7.1 ms]
>>>>>    [Choose CSet: 0.0 ms]
>>>>>    [Ref Proc: 2.3 ms]
>>>>>    [Ref Enq: 0.2 ms]
>>>>>    [Redirty Cards: 0.2 ms]
>>>>>    [Humongous Register: 2.2 ms]
>>>>>    [Humongous Reclaim: 0.4 ms]
>>>>>    [Free CSet: 0.4 ms]
>>>>> [Eden: 1552.0M(1552.0M)->0.0B(1488.0M) Survivors: 80.0M->144.0M Heap:
>>>>> 25.8G(32.0G)->24.4G(32.0G)]
>>>>> Heap after GC invocations=1144024 (full 72):
>>>>> garbage-first heap   total 33554432K, used 25550050K
>> [0x00007f1478000000,
>>>>> 0x00007f1478808000, 0x00007f1c78000000)
>>>>> region size 8192K, 18 young (147456K), 18 survivors (147456K)
>>>>> Metaspace       used 41184K, capacity 41752K, committed 67072K,
>> reserved
>>>>> 67584K
>>>>> }
>>>>> [Times: user=0.82 sys=0.00, real=0.05 secs]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2018-02-27 20:58 GMT+08:00 Emir Arnautović <
>> emir.arnauto...@sematext.com
>>>>> :
>>>>> 
>>>>>> Ah, so there are ~560 shards per node and not all nodes are indexing
>> at
>>>>>> the same time. Why is that? You can have better throughput if indexing
>>>> on
>>>>>> all nodes. If happy with shard size, you can create new collection
>> with
>>>> 49
>>>>>> shards every 2h and have everything the same and index on all nodes.
>>>>>> 
>>>>>> Back to main question: what is the heap utilisation? When you restart
>>>> node
>>>>>> what is heap utilisation? Do you see any errors in your logs? Do you
>> see
>>>>>> any errors in ZK logs?
>>>>>> 
>>>>>> Emir
>>>>>> --
>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 27 Feb 2018, at 13:22, 苗海泉 <mseaspr...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Thanks  for you reply again.
>>>>>>> I just said that you may have some misunderstanding, we have 49 solr
>>>>>> nodes,
>>>>>>> each collection has 25 shards, each shard has only one replica of the
>>>>>> data,
>>>>>>> there is no copy, and I reduce the part of the cache. If you need the
>>>>>>> metric data, I can check Come out to tell you, in addition we are
>> only
>>>>>>> additional system, there will not be any change action.
>>>>>>> 
>>>>>>> 2018-02-27 20:05 GMT+08:00 Emir Arnautović <
>>>> emir.arnauto...@sematext.com
>>>>>>> :
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> It is hard to tell without looking more into your metrics. It seems
>> to
>>>>>> me
>>>>>>>> that you are reaching limits of your cluster. I would doublecheck if
>>>>>> memory
>>>>>>>> is the issue. If I got it right, you have ~1120 shards per node. It
>>>>>> takes
>>>>>>>> some heap just to keep them open. If you have some caches enabled
>> and
>>>>>> if it
>>>>>>>> is append only system, old shards will keep caches until reloaded.
>>>>>>>> Probably will not make much diff, but with 25x2=50 shards and 49
>>>> nodes,
>>>>>>>> one node will need to handle double indexing load.
>>>>>>>> 
>>>>>>>> Emir
>>>>>>>> --
>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>>>> Solr & Elasticsearch Consulting Support Training -
>>>> http://sematext.com/
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 27 Feb 2018, at 12:54, 苗海泉 <mseaspr...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> In addition, we found that the rate was normal when the number of
>>>>>>>>> collections was kept below 936 and the speed was slower and slower
>> at
>>>>>>>> 984.
>>>>>>>>> Therefore, we could only temporarily delete the older collection,
>> but
>>>>>> now
>>>>>>>>> we need more Online collection, there has been no good way to
>> confuse
>>>>>> us
>>>>>>>>> for a long time, very much hope to give a solution to the problem
>> of
>>>>>>>> ideas,
>>>>>>>>> greatly appreciated
>>>>>>>>> 
>>>>>>>>> 2018-02-27 19:46 GMT+08:00 苗海泉 <mseaspr...@gmail.com>:
>>>>>>>>> 
>>>>>>>>>> Thank you for reply.
>>>>>>>>>> One collection has 25 shard one replica, one solr node has about
>> 5T
>>>> on
>>>>>>>>>> desk.
>>>>>>>>>> GC is checked ,and modify as follow :
>>>>>>>>>> SOLR_JAVA_MEM="-Xms32768m -Xmx32768m "
>>>>>>>>>> GC_TUNE=" \
>>>>>>>>>> -XX:+UseG1GC \
>>>>>>>>>> -XX:+PerfDisableSharedMem \
>>>>>>>>>> -XX:+ParallelRefProcEnabled \
>>>>>>>>>> -XX:G1HeapRegionSize=8m \
>>>>>>>>>> -XX:MaxGCPauseMillis=250 \
>>>>>>>>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>>>>>>>>> -XX:+UseLargePages \
>>>>>>>>>> -XX:+AggressiveOpts \
>>>>>>>>>> -XX:+UseLargePages"
>>>>>>>>>> 
>>>>>>>>>> 2018-02-27 19:27 GMT+08:00 Emir Arnautović <
>>>>>>>> emir.arnauto...@sematext.com>:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> To get more complete picture, can you tell us how many
>>>>>> shards/replicas
>>>>>>>> do
>>>>>>>>>>> you have per collection? Also what is index size on disk? Did you
>>>>>>>> check GC?
>>>>>>>>>>> 
>>>>>>>>>>> BTW, using 32GB heap prevents you from using compressed oops,
>>>>>> resulting
>>>>>>>>>>> in less memory available than 31GB.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Emir
>>>>>>>>>>> --
>>>>>>>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>>>>>>>> Solr & Elasticsearch Consulting Support Training -
>>>>>>>> http://sematext.com/
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 27 Feb 2018, at 11:36, 苗海泉 <mseaspr...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I encountered a more serious problem in the process of using
>> solr.
>>>>>> We
>>>>>>>>>>> use
>>>>>>>>>>>> the solr version is 6.0, our daily amount of data is about 500
>>>>>> billion
>>>>>>>>>>>> documents, create a collection every hour, the online collection
>>>> of
>>>>>>>> more
>>>>>>>>>>>> than a thousand, 49 solr nodes. If the collection in less than
>>>> 800,
>>>>>>>> the
>>>>>>>>>>>> speed is still very fast, if the collection of the number of
>> 1100
>>>> or
>>>>>>>> so,
>>>>>>>>>>>> the construction of solr index will drop sharply, one of the
>>>>>> original
>>>>>>>>>>>> program speed of about 2-3 million TPS, Dropped to only a few
>>>>>> hundred
>>>>>>>> or
>>>>>>>>>>>> even tens of TPS, who have encountered a similar situation,
>> there
>>>> is
>>>>>>>> no
>>>>>>>>>>>> good idea to find this issue. By the way, solr a node memory we
>>>>>>>> assigned
>>>>>>>>>>>> 32G,We checked the memory, cpu, disk IO, network IO occupancy is
>>>> no
>>>>>>>>>>>> problem, belong to the normal state. Which friend encountered a
>>>>>>>> similar
>>>>>>>>>>>> problem, please inform the solution, thank you very much.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> ==============================
>>>>>>>>>> 联创科技
>>>>>>>>>> 知行如一
>>>>>>>>>> ==============================
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> ==============================
>>>>>>>>> 联创科技
>>>>>>>>> 知行如一
>>>>>>>>> ==============================
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> ==============================
>>>>>>> 联创科技
>>>>>>> 知行如一
>>>>>>> ==============================
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> ==============================
>>>>> 联创科技
>>>>> 知行如一
>>>>> ==============================
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> ==============================
>>> 联创科技
>>> 知行如一
>>> ==============================
>> 
>> 
> 
> 
> -- 
> ==============================
> 联创科技
> 知行如一
> ==============================

Re: When the number of collections exceeds one thousand, the construction of indexing speed drops sharply

Reply via email to