Re: Production Issue: TIMED_WAITING - Will net.ipv4.tcp_tw_reuse=1 help?

Doss Mon, 10 Aug 2020 18:37:10 -0700

Hi Dominique,

Thanks for the response.


I don't think I would use a JVM version 14. OpenJDK 11 in my opinion is the
best choice for LTS version.

>> We will try changing it.

You change a lot of default values. Any specific raisons ? Il seems very
aggressive !

>> Our product team wants data to be reflected in Near Real Time.
 mergePolicyFactory, mergeScheduler - This is based on our oldest SOLR
cluster where these parameter tweaking gave good results.

You have to analyze GC on all nodes !

>> I checked other nodes GC, found no issues. I shared the node's GC which
gets into trouble very frequently.

Your heap is very big. According to full GC frequency, I don't think you
really need such a big heap for only indexing. May be when you will perform
queries.

>> Heap Sizing is based on the select requests we are expecting. We expect
it would be around 10 to 15 million per day. We have plans to increase CPU
before routing select traffics.

Did you check your network performances ?

>> We do checked in sar reports, but unable to figure out an issue, we use
10 GBPS connection. Is there any SOLR metric API which will give network
related information? Please suggest other ways to dig this further.

Did you check Zookeeper logs ?

>> We never looked at the Zookeeper logs, will check and share, is there
any kind of information to watch out for?

Regards,
Doss


On Monday, August 10, 2020, Dominique Bejean <dominique.bej...@eolya.fr>
wrote:

> Doss,
>
> See below.
>
> Dominique
>
>
> Le lun. 10 août 2020 à 17:41, Doss <itsmed...@gmail.com> a écrit :
>
>> Hi Dominique,
>>
>> Thanks for your response. Find below the details, please do let me know
>> if anything I missed.
>>
>>
>> *- hardware architecture and sizing*
>> >> Centos 7, VMs,4CPUs, 66GB RAM, 16GB Heap, 250GB SSD
>>
>>
>> *- JVM version / settings    *
>> >> Red Hat, Inc. OpenJDK 64-Bit Server VM, version:"14.0.1 14.0.1+7" -
>> Default Settings including GC
>>
>
> I don't think I would use a JVM version 14. OpenJDK 11 in my opinion is
> the best choice for LTS version.
>
>
>>
>> *- Solr settings    *
>> >> softCommit: 15000 (15 sec), autoCommit: 300000 (5 mins)
>> <mergePolicyFactory 
>> class="org.apache.solr.index.TieredMergePolicyFactory"><int
>> name="maxMergeAtOnce">30</int> <int name="maxMergeAtOnceExplicit">100</int>
>> <double name="segmentsPerTier">30.0</double> </mergePolicyFactory>
>>
>>           <mergeScheduler 
>> class="org.apache.lucene.index.ConcurrentMergeScheduler"><int
>> name="maxMergeCount">18</int><int name="maxThreadCount">6</int><
>> /mergeScheduler>
>>
>
> You change a lot of default values. Any specific raisons ? Il seems very
> aggressive !
>
>
>>
>>
>> *- collections and queries information   *
>> >> One Collection, with 4 shards , 3 replicas , 3.5 Million Records, 150
>> columns, mostly integer fields, Average doc size is 350kb. Insert / Updates
>> 0.5 Million Span across the whole day (peak time being 6PM to 10PM) ,
>> selects not yet started. Daily once we do delta import of cetrain fields of
>> type multivalued with some good amount of data.
>>
>> *- gc logs or gceasy results*
>>
>> Easy GC Report says GC health is good, one server's gc report:
>> https://drive.google.com/file/d/1C2SqEn0iMbUOXnTNlYi46Gq9kF_
>> CmWss/view?usp=sharing
>> CPU Load Pattern: https://drive.google.com/file/d/
>> 1rjRMWv5ritf5QxgbFxDa0kPzVlXdbySe/view?usp=sharing
>>
>>
> You have to analyze GC on all nodes !
> Your heap is very big. According to full GC frequency, I don't think you
> really need such a big heap for only indexing. May be when you will perform
> queries.
>
> Did you check your network performances ?
> Did you check Zookeeper logs ?
>
>
>>
>> Thanks,
>> Doss.
>>
>>
>>
>> On Mon, Aug 10, 2020 at 7:39 PM Dominique Bejean <
>> dominique.bej...@eolya.fr> wrote:
>>
>>> Hi Doss,
>>>
>>> See a lot of TIMED_WATING connection occurs with high tcp traffic
>>> infrastructure as in a LAMP solution when the Apache server can't
>>> anymore connect to the MySQL/MariaDB database.
>>> In this case, tweak net.ipv4.tcp_tw_reuse is a possible solution (but
>>> never net.ipv4.tcp_tw_recycle as you suggested in your previous post).
>>> This
>>> is well explained in this great article
>>> https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux
>>>
>>> However, in general and more specifically in your case, I would
>>> investigate
>>> the root cause of your issue and do not try to find a workaround.
>>>
>>> Can you provide more information about your use case (we know : 3 node
>>> SOLR
>>> (8.3.1 NRT) + 3 Node Zookeeper Ensemble) ?
>>>
>>>    - hardware architecture and sizing
>>>    - JVM version / settings
>>>    - Solr settings
>>>    - collections and queries information
>>>    - gc logs or gceasy results
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>
>>> Le lun. 10 août 2020 à 15:43, Doss <itsmed...@gmail.com> a écrit :
>>>
>>> > Hi,
>>> >
>>> > In solr 8.3.1 source, I see the following , which I assume could be the
>>> > reason for the issue "Max requests queued per destination 3000
>>> exceeded for
>>> > HttpDestination",
>>> >
>>> > solr/solrj/src/java/org/apache/solr/client/solrj/impl/
>>> Http2SolrClient.java:
>>> >    private static final int MAX_OUTSTANDING_REQUESTS = 1000;
>>> > solr/solrj/src/java/org/apache/solr/client/solrj/impl/
>>> Http2SolrClient.java:
>>> >      available = new Semaphore(MAX_OUTSTANDING_REQUESTS, false);
>>> > solr/solrj/src/java/org/apache/solr/client/solrj/impl/
>>> Http2SolrClient.java:
>>> >      return MAX_OUTSTANDING_REQUESTS * 3;
>>> >
>>> > how can I increase this?
>>> >
>>> > On Mon, Aug 10, 2020 at 12:01 AM Doss <itsmed...@gmail.com> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > We are having 3 node SOLR (8.3.1 NRT) + 3 Node Zookeeper Ensemble
>>> now and
>>> > > then we are facing "Max requests queued per destination 3000
>>> exceeded for
>>> > > HttpDestination"
>>> > >
>>> > > After restart evering thing starts working fine until another
>>> problem.
>>> > > Once a problem occurred we are seeing soo many TIMED_WAITING threads
>>> > >
>>> > > Server 1:
>>> > >    *7722*  Threads are in TIMED_WATING
>>> > >
>>> > ("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$
>>> ConditionObject@151d5f2f
>>> > > ")
>>> > > Server 2:
>>> > >    *4046*   Threads are in TIMED_WATING
>>> > >
>>> > ("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$
>>> ConditionObject@1e0205c3
>>> > > ")
>>> > > Server 3:
>>> > >    *4210*   Threads are in TIMED_WATING
>>> > >
>>> > ("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$
>>> ConditionObject@5ee792c0
>>> > > ")
>>> > >
>>> > > Please suggest whether net.ipv4.tcp_tw_reuse=1 will help ? or how
>>> can we
>>> > > increase the 3000 limit?
>>> > >
>>> > > Sorry, since I haven't got any response to my previous query,  I am
>>> > > creating this as new,
>>> > >
>>> > > Thanks,
>>> > > Mohandoss.
>>> > >
>>> >
>>>
>>

Re: Production Issue: TIMED_WAITING - Will net.ipv4.tcp_tw_reuse=1 help?

Reply via email to