Re: tlog keeps growing
Erik, Looks like we're also running into this issue. https://www.mail-archive.com/solr-user@lucene.apache.org/msg153798.html Is there any think we can do to remedy this besides a node restart, which causes leader re-election on the good shards which causes them to also become un-operational? -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Down Replica is elected as Leader (solr v8.7.0)
> Are yours growing always, on all nodes, forever? Or is it one or two who ends up in a bad state? Randomly on some of the shards and some of the followers in the collection. Then whichever tlog was open on follower when it was the leader, that one doesn't stops growing. And that shard had active ingestion at a high rate. If we now add more shards, or do a cluster rebalance, the collection is unsusable and causes a production query and ingest outage. Very painful manual restoration twice a day. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Down Replica is elected as Leader (solr v8.7.0)
Looks like the problem is related to tlog rotation on the follower shard. We did the following for a specific shard. 0. start solr cloud 1. solr-0 (leader), solr-1, solr-2 2. rebalance to make solr-1 as preferred leader 3. solr-0, solr-1 (leader), solr-2 The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we shut the cluster and dropped all shards (manually). Only way to "restart" tlog rotation on solr-0 (follower) was to issue /admin/cores/action=RELOAD=x atleast twice when the tlog size was small (in MBs). Also if rebalance is is issued to select solr-0 as a leader, leader election never completes. solr-0 output after step (3) above. solr-0 2140856 ./data2/mydata_0_e000-/tlog 2140712 ./data2/mydata_0_e000-/tlog/tlog.021 solr-1 (leader) 35268 ./data2/mydata_0_e000-/tlog 35264 ./data2/mydata_0_e000-/tlog/tlog.055 solr-2 35256 ./data2/mydata_0_e000-/tlog 35252 ./data2/mydata_0_e000-/tlog/tlog.054 -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Down Replica is elected as Leader (solr v8.7.0)
Looks like the problem is related to tlog rotation on the follower shard. We did the following for a specific shard. 0. start solr cloud 1. solr-0 (leader), solr-1, solr-2 2. rebalance to make solr-1 as preferred leader 3. solr-0, solr-1 (leader), solr-2 The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we shut the cluster and dropped all shards (manually). Only way to "restart" tlog rotation on solr-0 (follower) was to issue /admin/cores/action=RELOAD=x atleast twice when the tlog size was small (in MBs). Also if rebalance is is issued to select solr-0 as a leader, leader election never completes. solr-0 output after step (3) above. solr-0 2140856 ./data2/mydata_0_e000-/tlog 2140712 ./data2/mydata_0_e000-/tlog/tlog.021 solr-1 (leader) 35268 ./data2/mydata_0_e000-/tlog 35264 ./data2/mydata_0_e000-/tlog/tlog.055 solr-2 35256 ./data2/mydata_0_e000-/tlog 35252 ./data2/mydata_0_e000-/tlog/tlog.054 -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Down Replica is elected as Leader (solr v8.7.0)
We found that for the shard that does not get a leader, the tlog replay did not complete (we don't see "log replay finished", "creating leader registration node", "I am the new leader" etc log messages) for hours. Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB). Collection's shards have 3x replicas, TLOG replication and 10sec hard commit. The situation is resulting in 2x per day outage. Current work around is to stop ingestion, issue a collection rebalance and/or node restarts and repeat until shards are online (a couple of hrs each day of manual recovery). Any suggestions or workarounds? Not sure if we're running into these issues: https://issues.apache.org/jira/browse/SOLR-13486 https://issues.apache.org/jira/browse/SOLR-14679 Partial log output from both nodes (sorted by time asc): myapp-data-solr-0 2021-02-12 19:36:05.965 INFO (zkCallback-14-thread-51) [c:mydata s:0_8000-9fff r:core_node3 x:mydata_0_8000-9fff_replica_t1] o.a.s.c.ShardLeaderElectionContext Replaying tlog before become new leader myapp-data-solr-0 2021-02-12 19:36:05.966 WARN (recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr x:mydata_0_8000-9fff_replica_t1 c:mydata s:0_8000-9fff r:core_node3) [c:mydata s:0_8000-9fff r:core_node3 x:mydata_0_8000-9fff_replica_t1] o.a.s.u.UpdateLog Starting log replay tlog{file=/opt/solr/volumes/data1/mydata_0_8000-9fff/tlog/tlog.0003525 refcount=2} active=false starting pos=0 inSortedOrder=true myapp-data-solr-0 2021-02-12 22:13:03.084 INFO (recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr x:mydata_0_8000-9fff_replica_t1 c:mydata s:0_8000-9fff r:core_node3) [c:mydata s:0_8000-9fff r:core_node3 x:mydata_0_8000-9fff_replica_t1] o.a.s.u.UpdateLog log replay status tlog{file=/opt/solr/volumes/data1/mydata_0_8000-9fff/tlog/tlog.0003525 refcount=3} active=false starting pos=0 current pos=27101174167 current size=33357447222 % read=81.0 myapp-data-solr-0 2021-02-12 22:13:06.602 ERROR (indexFetcher-4092-thread-1) [ ] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: mydata slice: 0_8000-9fff saw state=DocCollection(mydata//collections/mydata/state.json/750)={ "pullReplicas":"0", "replicationFactor":"0", "shards":{ "0_8000-9fff":{ "range":null, "state":"active", "replicas":{ "core_node3":{ "core":"mydata_0_8000-9fff_replica_t1", "base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr;, "node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{ "core":"mydata_0_8000-9fff_replica_t2", "base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr;, "node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false", "property.preferredleader":"true"}, "core_node6":{ "core":"mydata_0_8000-9fff_replica_t4", "base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr;, "node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false"}}}, myapp-data-solr-0 2021-02-12 22:45:51.600 ERROR (indexFetcher-4092-thread-1) [ ] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: mydata slice: 0_8000-9fff saw state=DocCollection(mydata//collections/mydata/state.json/754)={ "pullReplicas":"0", "replicationFactor":"0", "shards":{ "0_8000-9fff":{ "range":null, "state":"active", "replicas":{ "core_node3":{ "core":"mydata_0_8000-9fff_replica_t1", "base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr;, "node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr", "state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{ "core":"mydata_0_8000-9fff_replica_t2", "base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr;, "node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false", "property.preferredleader":"true"}, "core_node6":{ "core":"mydata_0_8000-9fff_replica_t4", "base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr;, "node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr", "state":"down", "type":"TLOG", "force_set_state":"false"}}},... myapp-data-solr-1 2021-02-12 22:45:56.600 ERROR (indexFetcher-4092-thread-1) [ ] o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: mydata slice: 0_8000-9fff saw
Re: Down Replica is elected as Leader (solr v8.7.0)
By tracing the output in the log files we see the following sequence. Overseer role list has POD-1, POD-2, POD-3 in that order POD-3 has 2 shard leaders. POD-3 restarts. A) Logs for the shard whose leader moves successfully from POD-3 to POD-1 On POD-1: o.a.s.c.ShardLeaderElectionContext Replaying tlog before become new leader On POD-1: o.a.s.u.UpdateLog Starting log replay On POD-1: o.a.s.u.UpdateLog Log replay finished. On POD-1: o.a.s.c.SolrCore Registered new searcher autowarm time: 0 ms On POD-1: o.a.s.c.ShardLeaderElectionContextBase Creating leader registration node ... after winning as ... On POD-1: o.a.s.c.ShardLeaderElectionContext I am the new leader: B) Logs for the shard whose leader does not move from POD-3 to POD-1 On POD-1: o.a.s.c.ShardLeaderElectionContext Replaying tlog before become new leader On POD-1: o.a.s.u.UpdateLog Starting log replay... On POD-1: o.a.s.h.ReplicationHandler Index fetch failed :org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms < POD-3 is back up at this time > On POD-3: o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: No registered leader On was found after waiting for 4000ms It was odd to see no INFO, WARN or ERROR log message after "Starting log replay" on POD-1 for the shard which did not get its leader elected. -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Down Replica is elected as Leader (solr v8.7.0)
Hello, On reboot of one of the solr nodes in the cluster, we often see a collection's shards with 1. LEADER replica in DOWN state, and/or 2. shard with no LEADER Output from /solr/admin/collections?action=CLUSTERSTATUS is below. Even after 5 to 10 minutes, the collection often does not recover. Unclear why this is happening and what we can try to prevent or remedy it. ps: perReplicaState= true in solr v8.8.0 didn't work well because after a rebalance all replicas somehow get a "leader:true" status even though states.json looked ok. { "responseHeader": { "status": 0, "QTime": 2 }, "cluster": { "collections": { "datacore": { "pullReplicas": "0", "replicationFactor": "0", "shards": { "__": { "range": null, "state": "active", "replicas": { "core_node1": { "core": "datacore____replica_t187", "base_url": "http://solr-0.solr-headless:8983/solr;, "node_name": "solr-0.solr-headless:8983_solr", "state": "down", "type": "TLOG", "force_set_state": "false", "property.preferredleader": "true", "leader": "true" }, "core_node2": { "core": "datacore____replica_t188", "base_url": "http://solr-1.solr-headless:8983/solr;, "node_name": "solr-1.solr-headless:8983_solr", "state": "active", "type": "TLOG", "force_set_state": "false" }, "core_node3": { "core": "datacore____replica_t189", "base_url": "http://solr-2.solr-headless:8983/solr;, "node_name": "solr-2.solr-headless:8983_solr", "state": "active", "type": "TLOG", "force_set_state": "false" } } }, "__j": { "range": null, "state": "active", "replicas": { "core_node19": { "core": "datacore___j_replica_t187", "base_url": "http://solr-0.solr-headless:8983/solr;, "node_name": "solr-0.solr-headless:8983_solr", "state": "down", "type": "TLOG", "force_set_state": "false", "property.preferredleader": "true" }, "core_node20": { "core": "datacore___j_replica_t188", "base_url": "http://solr-1.solr-headless:8983/solr;, "node_name": "solr-1.solr-headless:8983_solr", "state": "active", "type": "TLOG", "force_set_state": "false" }, "core_node21": { "core": "datacore___j_replica_t189", "base_url": "http://solr-2.solr-headless:8983/solr;, "node_name": "solr-2.solr-headless:8983_solr", "state": "active", "type": "TLOG", "force_set_state": "false" } } }, "__": { "range": null, "state": "active", "replicas": { "core_node4": { "core": "datacore____replica_t91", "base_url": "http://solr-0... -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Json Faceting Performance Issues on solr v8.7.0
> Does this happen on a warm searcher (are subsequent requests with no intervening updates _ever_ fast?)? Subsequent response times very fast if searcher remains open. As a control test, I faceted on the same field that I used in the q param. 1. Start solr 2. Execute q=resultId:x=0 => 500ms 3. Execute q=resultId:x=0 => 40,000ms 4. Execute q=resultId:x=0 => 150ms 5. Execute q=processId:x=0 => 2,500ms 6. Execute q=processId:x=0 => 200ms curl 'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x=0' -d ' json.facet={ categories:{ "type": "terms", "field" : "processId", "limit" : 1 } } -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Json Faceting Performance Issues on solr v8.7.0
Ok. I'll try that. Meanwhile query on resultId is subsecond response. But the immediate next query for faceting takes 40+secs. The core has 185million docs and 63GB index size. curl 'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x=0' { "responseHeader":{ "zkConnected":true, "status":0, "QTime":558, "params":{ "q":"resultId:x", "cache":"false", "rows":"0"}}, "response":{"numFound":943,"start":0,"numFoundExact":true,"docs":[] }} curl 'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x=0' -d ' json.facet={ categories:{ "type": "terms", "field" : "resultId", "limit" : 1 } }' { "responseHeader":{ "zkConnected":true, "status":0, "QTime":43834, "params":{ "q":"resultId:x", "json.facet":"{\ncategories:{\n \"type\": \"terms\",\n \"field\" : \"resultId\",\n \"limit\" : 1\n}\n}", "cache":"false", "rows":"0"}}, "response":{"numFound":943,"start":0,"numFoundExact":true,"docs":[] }, "facets":{ "count":943, "categories":{ "buckets":[{ "val":"x", "count":943}]}}} -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Json Faceting Performance Issues on solr v8.7.0
Hello, I am seeing very slow response from json faceting against a single core (though core is shard leader in a collection). Fields processId and resultId are non-multivalued, indexed and docvalues string (not text). Soft Commit = 5sec (opensearcher=true) and Hard Commit = 10sec because new docs are constantly being indexed with 95% new and 5% overwritten (overwrite=true; no atomic update). Caches are not considered useful due to commit frequency. Solr is v8.7.0 on openjdk11. Is there any way to improve json facet QTime? ## query only curl 'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x=0' -d ' { "responseHeader":{ "zkConnected":true, "status":0, "QTime":552, "params":{ "q":"processId:-xxx-xxx-xxx-x", "cache":"false", "rows":"0"}}, "response":{"numFound":231311,"start":0,"numFoundExact":true,"docs":[] }} ## json facet takes 46secs curl 'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x=0' -d ' json.facet={ categories:{ "type": "terms", "field" : "resultId", "limit" : 1 } }' { "responseHeader":{ "zkConnected":true, "status":0, "QTime":46972, "params":{ "q":"processId:-xxx-xxx-xxx-x", "json.facet":"{categories:{ \"type\": \"terms\", \"field\" : \"resultId\", \"limit\" : 1}}", "rows":"0"}}, "response":{"numFound":231311,"start":0,"numFoundExact":true,"docs":[] }, "facets":{ "count":231311, "categories":{ "buckets":[{ "val":"x", "count":943}]}}} ## visualvm CPU sampling almost all time spent in lucene: org.apache.lucene.util.PriorityQueue.downHeap() 23,009 ms org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next() 13,268 ms -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Dynamic schema failure for child docs not using "_childDocuments_" key
I am running into a exception where creating child docs fails unless the field already exists in the schema (stacktrace is at the bottom of this post). My solr is v8.5.1 running in standard/non-cloud mode. $> curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/mycore/update' --data-binary '[{ "id": "3dae27db6ee43e878b9d0e8e", "phone": "+1 (123) 456-7890", "myChildDocuments": [{ "id": "3baf27db6ee43387849d0e8e", "enabled": false }] }]' { "responseHeader":{ "status":400, "QTime":285}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","org.apache.solr.common.SolrException"], "msg":"ERROR: [doc=3baf27db6ee43387849d0e8e] unknown field 'enabled'", "code":400}} However using "_childDocuments_" key, it succeeds and child doc fields get created in the managed-schema $> curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/mycore/update' --data-binary '[{ "id": "6dae27db6ee43e878b9d0e8e", "phone": "+1 (123) 456-7890", "_childDocuments_": [{ "id": "6baf27db6ee43387849d0e8e", "enabled": false }] }]' { "responseHeader":{ "status":0, "QTime":285}} == stacktrace == 2020-05-06 01:01:26.762 ERROR (qtp1569435561-19) [ x:standalone] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=3baf27db6ee43387849d0e8e] unknown field 'enabled' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:100) at org.apache.solr.update.AddUpdateCommand.lambda$null$0(AddUpdateCommand.java:224) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1631) at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294) at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206) at java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161) at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300) at java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:282) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1284) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1277) at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:975) at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345) at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.NestedUpdateProcessorFactory$NestedUpdateProcessor.processAdd(NestedUpdateProcessorFactory.java:79) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259) at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489) at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339) at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225) at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) at org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:477)
Using Deep Paging with Graph Query Parser
Is there a way to use combine paging's cursor feature with graph query parser? Background: I have a hierarchical data structure that is split into N different flat json docs and updated (inserted) into solr with from/to fields. Using the from/to join syntax a graph query is needed since different queries need parents (given certain child filters) and different queries need children (given certain parent filters). Graph query parser (though not distributed) looks ideal. However I need pagination to iterate on the results. Hints for custom code are ok, since current solr install has lots admin, core and collection handlers already running. -M -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr core corrupted for version 7.4.0, please help!
Thank you for https://issues.apache.org/jira/browse/SOLR-12691. I see it's marked as minor. Can we bump up the priority please ? The example of 2 cores ingest + transientCacheSize==1 was provided for reproduction reference only, and is not running in not production. Production setup on AWS uses biggest non-baremetal EC2 instances available. And there are quite a few of them. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr core corrupted for version 7.4.0, please help!
> Having 100+ cores on a Solr node and a transient cache size of 1 The original post clarified the current state. "we have about 75 cores with "transientCacheSize" set to 32". If transientCacheSize is increased to match current cores, we'll differ the issue. It's going to hit 100's cores per solr instance shortly. Majority of the cores do not get ingest request so it would be waste to spend memory resources and that is why all solr.xml for cores is: transient=true loadOnStartup=false JVM heap is 30GB on 64bit inside solr docker running in kubernetes on aws. Any other things that can be done to avoid the core corruption or pointers what could be causing the index segment loss ? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr core corrupted for version 7.4.0, please help!
> The problem here is that you may have M requests queued up for the _same_ core, each with a new update request. With transientCacheSize ==1, as soon as the update request for Core B is received, Core B encounters data corruption not Core A. Both Core A and Core B are receiving update requets. I am presuming this happens on core close after the ref count is decremented for Core B to process the request for Core A. In production number of cores on solr == 100+. Transient cache is sized so jvm heap and os filecache is aligned to a manageable number of open cores we can serve (isOpen != false). So when a random update request comes in, corruption is seen. Very painful since now a restore needs to be invoked. This is happening 5 to 10 times a day. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Is commit for SOLR-11444 for SolrCloudClient#sendRequest(SolrRequest, List) ok?
In the below mentioned git commit, I see SolrCloudClient has been changed to generate solr core urls differently than before. In the previous version, solr urls were computed using "url = coreNodeProps.getCoreUrl()". This concatenated "base_url" + "core" name from the clusterstate for a tenant's shards. re: http://:/solr/ Now, the solr url is computed using "url = ZkCoreNodeProps.getCoreUrl(nodeProps.getStr(ZkStateReader.BASE_URL_PROP), joinedInputCollections)" This concatenates "base_url" + "collection" name from the clusterstate for a tenant's shards. re: http://:/solr/ This change broke a custom client which used to previously work for me. Can someone provide guidance ? Thank you. > commit da71587196eccefd8b58923dc162916c808762a9 > Author: David Smiley > Date: Thu Oct 19 00:02:24 2017 -0400 > >SOLR-11444: Improve consistency of collection alias handling and > collection list references. >Other refactorings of nearby code too. > >(cherry picked from commit e001f35) > > diff --git > a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java > b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java > index 6a208f48c0..f63eedd742 100644 > --- > a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java > +++ > b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java >... >- if (!liveNodes.contains(coreNodeProps.getNodeName()) >- || Replica.State.getState(coreNodeProps.getState()) != Replica.State.ACTIVE) continue; >- if (nodes.put(node, nodeProps) == null) { >-if (!sendToLeaders || coreNodeProps.isLeader()) { >- String url; >- if (reqParams.get(UpdateParams.COLLECTION) == null) { >-url = ZkCoreNodeProps.getCoreUrl(nodeProps.getStr(ZkStateReader.BASE_URL_PROP), collection); >- } else { >-url = coreNodeProps.getCoreUrl(); > - } > - urlList2.add(url); + if (!liveNodes.contains(node) // Must be a live node to continue + || Replica.State.getState(coreNodeProps.getState()) != Replica.State.ACTIVE) // Must be an ACTIVE replica to continue +continue; + if (seenNodes.add(node)) { // if we haven't yet collected a URL to this node... +String url = ZkCoreNodeProps.getCoreUrl(nodeProps.getStr(ZkStateReader.BASE_URL_PROP), joinedInputCollections); +if (sendToLeaders && coreNodeProps.isLeader()) { + theUrlList.add(url); // put leaders here eagerly (if sendToLeader mode) -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
FYI. This issue went away after solrconfig.xml was tuned. "Hard commits blocked | non-solrcloud v6.6.2" thread has the details. http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
The below solrconfig.xml settings resolved the TIMED_WAIT in ConcurrentMergeScheduler.doStall(). Thanks to Shawn and Erik for their pointers. ... 30 100 30.0 18 6 300 ... ${solr.autoCommit.maxTime:3} false ${solr.autoSoftCommit.maxTime:1} ${solr.autoSoftCommit.maxDocs:-1} true -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
> https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_256 I had uploaded the output at the above link. The OS has no swap configured. There are other processes on the host but <1GB or <5% CPU cumulatively but none inside the docker as `top` shows. Solr JVM heap is at 30GB. JVM monitoring shows heap never went over 10GB. Also there were no queries at that time. Few hours earlier I did run a quite a few heavy hitting distributed paging and json facet queries. The machine is 32vCPU and 240GB memory. Please let me know if vmstat and/or iostat will be helpful. I didn't save the gc logs at that time but can capture it if you like. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
Hi Shawn, Erik > updates should slow down but not deadlock. The net effect is the same. As the CLOSE_WAITs increase, jvm ultimately stops accepting new socket requests, at which point `kill ` is the only option. This means if replication handler is invoked which sets the deletion policy, the threads blocked rises even faster and system fails even faster. Each solr POST is a blocking call, hence the CLOSE_WAITs. Also the POST gzip is an json array of 100 json objects (1 json doc = 1 solr doc). All custom AbstractSolrEventListener listeners were disabled to not process any post commit events. Those threads are in WAITING state, which is ok. I then ran /solr/58f449cec94a2c75-core-256/admin/luke at 10:30pm PST It showed "lastModified: 2018-02-11T04:46:54.540Z" indicating commit blocked for about 2 hours. Hard commit is set as 10secs in solrconfig.xml Other cores are also blocked for a while. Thread dump and top output are from that condition are at https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_256 netstat CLOSE_WAIT are correlated with DirectUpdateHandler2 / UpdateRequestProcessor.processAdd() requests. solr [ /tmp ]$ sudo netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c; TZ=PST8PDT date; 7728 CLOSE_WAIT - 65 ESTABLISHED - 1 FIN_WAIT2 - 1 Foreign Address 6 LISTEN - 36 TIME_WAIT - 1 established) Sat Feb 10 22:27:07 PST 2018 http://fastthread.io shows lots 6,700 threads in TIMED_WAIT https://jstack.review shows 6584 threads with this stack at java.lang.Object.wait(Native Method) at org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616) at org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602) ... at org.apache.solr.update.DirectUpdateHandler2.allowDuplicateUpdate(DirectUpdateHandler2.java:276) ... at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55) ... Only `top` available on Photon OS is https://github.com/vmware/photon/blob/1.0/SPECS/procps-ng/procps-ng.spec. Those screenshots are attached. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
Ran /solr/58f449cec94a2c75-core-248/admin/luke at 7:05pm PST It showed "lastModified: 2018-02-10T02:25:08.231Z" indicating commit blocked for about 41 mins. Hard commit is set as 10secs in solrconfig.xml Other cores are also now blocked. https://jstack.review analysis of the thread dump says "Potential Deadlock" for three 3 cores. 0x00027afca8e0 ConcurrentMergeScheduler Possible deadlock Held by commitScheduler-26-thread-1 awaiting notification , holding [ 0x00027afca8e0 ] 0x00027fd19080 ConcurrentMergeScheduler Possible deadlock Held by commitScheduler-19-thread-1 awaiting notification , holding [ 0x00027fd19080 0x0002615ada20 ] 0x000280da09c8 ConcurrentMergeScheduler Possible deadlock Held by commitScheduler-24-thread-1 awaiting notification , holding [ 0x000280da09c8 0x00028e0cbf10 ] Thread dump and top output are from that condition are available at https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_248/solr58f449cec94a2c75_core_248 848 thread(s) have this kind of stack: --- at java.lang.Object.wait(Native Method) at org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616) at org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602) ... at org.apache.solr.update.DirectUpdateHandler2.allowDuplicateUpdate(DirectUpdateHandler2.java:276) ... -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
Shawn, Eric, Were you able to look at the thread dump ? https://github.com/mohsinbeg/datadump/blob/master/threadDump-7pjql_1.zip Or is there additional data I may provide. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
> Setting openSearcher to false on autoSoftCommit makes no sense. That was my mistake in my solrconfig.xml. Thank you for identifying it. I have corrected it. I then removed my custom element from my solrconfig.xml and both hard commit and /solr/admin/core hang issues seemed to go way for a couple of hours. Then I increased the ingestion load and the issue came back. Taking a heap dump I now see lots of these and back to commit hanging. Full thread dump is at https://github.com/mohsinbeg/datadump/blob/master/threadDump-7pjql_1.zip "commitScheduler-24-thread-1" #187 prio=5 os_prio=0 tid=0x7f7c9401d000 nid=0x369 runnable [0x7f7c1a664000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00026b4317f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871) at org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:176) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:121) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:618) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Hard commits blocked | non-solrcloud v6.6.2
> If you issue a manual commit > (http://blah/solr/core/update?commit=true) what happens? That call never returned back to client browser. So I also tried a core reload and did capture in the thread dump. That too never returned. "qtp310656974-1022" #1022 prio=5 os_prio=0 tid=0x7ef25401 nid=0x6bf runnable [0x7f0138394000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000271e49b00> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115) at org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:176) at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:227) at org.apache.solr.core.SolrCore.reload(SolrCore.java:644) - locked <0x000271f08ac8> (a java.lang.Object) at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1181) at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$2(CoreAdminOperation.java:111) at org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$95/234250762.execute(Unknown Source) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:384) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:748) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:729) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:510) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.jav\ a:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748) -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Hard commits blocked | non-solrcloud v6.6.2
I am seeing that after some time hard commits in all my solr cores stop and each one's searcher has an "opened at" date to be hours ago even though they are continuing to ingesting data successfully (index size increasing continuously). http://localhost:8983/solr/#/x-core/plugins?type=core=searcher "openedAt: 2018-02-08T01:52:24.950Z" Is there something I am doing incorrectly in my config ? I setup my solrconfig.xml without for my "bulk indexing" use cases. solrConfig.xml:- none 200 1 5 ${solr.autoCommit.maxTime:1} true ${solr.autoSoftCommit.maxTime:-1} ${solr.autoSoftCommit.maxDocs:-1} false Thread dump:- "commitScheduler-20-thread-1" #391 prio=5 os_prio=0 tid=0x7ef194011000 nid=0x43a in Object.wait() [0x7ec99533f000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616) - eliminated <0x00027005a0f0> (a org.apache.lucene.index.ConcurrentMergeScheduler) at org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602) - locked <0x00027005a0f0> (a org.apache.lucene.index.ConcurrentMergeScheduler) at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:524) - locked <0x00027005a0f0> (a org.apache.lucene.index.ConcurrentMergeScheduler) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083) at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:487) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1980) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2189) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1926) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:675) - locked <0x00026f20bb88> (a java.lang.Object) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
> Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 Looks like it is the issue. (I've readacted IP addresses below for security reasons) solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c 8425 CLOSE_WAIT - 92 ESTABLISHED - 1 FIN_WAIT2 - 1 Foreign Address 6 LISTEN - 712 TIME_WAIT - 1 established) solr [ /opt/solr ]$ echo "run -b org.eclipse.jetty.server:context=HTTP/1.1@63e2203c,id=0,type=serverconnector dump " | java -jar jmxterm-1.0.0-uber.jar -l localhost:18983 -v silent -n > jettyJmx.out solr [ /opt/solr ]$ netstat -anp | grep CLOSE_WAIT | head -n1 tcp1 0 10.xxx.x.xx:898310.xxx.x.xxx:53873 CLOSE_WAIT 1/java solr [ /opt/solr ]$ grep "10.xxx.x.xxx:53873" jettyJmx.out | | +- SelectionKey@5ee4ef9f{i=0}->SelectChannelEndPoint@69feb476{/10.xxx.x.xxx:53873<->8983,Open,in,out,-,-,1712/5000,HttpConnection@c93d7fa}{io=0/0,kio=0,kro=1} solr [ /opt/solr ]$ cat jettyJmx.out | grep 8983 | grep SelectChannelEndPoint | grep "{io=0/0,kio=0,kro=1}" | wc -l 8220 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 I have noticed when number of http requests / sec are increased, CLOSE_WAITS increase linearly until solr stops accepting socket connections. Netstat output is $ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c 9784 CLOSE_WAIT 1/java 1 ESTABLISHED - 3 ESTABLISHED 1/java 1 Foreign Address 2 LISTEN - 4 LISTEN 1/java 27 TIME_WAIT - 1 established) The custom client code is: RequestConfig config = RequestConfig.custom() .setConnectionRequestTimeout(4000) .setConnectTimeout(4000) .setSocketTimeout(4000) .build(); ConnectingIOReactor ioReactor = new DefaultConnectingIOReactor(); PoolingNHttpClientConnectionManager cmAsync = new PoolingNHttpClientConnectionManager(ioReactor); cmAsync.setMaxTotal(1); cmAsync.setDefaultMaxPerRoute(1000); asynClient = HttpAsyncClients.custom() .setDefaultRequestConfig(config) .setConnectionManager(cmAsync).build(); asynClient.start(); executor = Executors.newScheduledThreadPool(1); idleConnectionFuture = executor.scheduleAtFixedRate(() -> { cmAsync.closeExpiredConnections(); cmAsync.closeIdleConnections(1, TimeUnit.SECONDS); }, 1, 1, TimeUnit.SECONDS); Also /solr/admin/cores takes a very long time to respond (QTime = 300secs+). So curl with a timeout also causes an additional CLOSE_WAIT (as expected) curl -m 5 'http://:/solr/admin/cores' customhandler QTime = ~150 (ms) or lower even under max load for the active cores. Note there are 100's of solr cores on each solr jvm but only few needed to be open at any given time in each solr jvm to avoid heap memory bloat. solr.xml has this setting therefore because we're not running solr cloud. ${transientCacheSize:30} Also was removed from solrconfig.xml because we saw 1000's of threads BLOCKED on VersionBucket even with. ${solr.ulog.numVersionBuckets:655360} The side effect is lots of merges [ we'll tackle that when solr stops die-ing :-) ] -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
> You said that you're running Solr 6.2.2, but there is no 6.2.2 version. > but the JVM argument list includes "-Xmx512m" which is a 512MB heap My typos. They're 6.6.2 and -Xmx30g respectively. > many open connections causes is a large number of open file handles, solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count vm.max_map_count = 262144 The only thing I notice right before solr shutdown messages in solr.log the /update QTime goes from ~500ms to ~25. There is an automated health check that issues a kill on the due to http connection timeout. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"
Hello, In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to stop "working" with 3 mins of solr start. solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep 10.xxx.xxx.xxx | wc -l 9453 Only option is then`kill -9` because even `jcmd Thread.print` is unable to connect to the jvm. The problem can be reproduced at will. Any suggestions what could be causing this or the fix ? Details of system are as follows and has been setup for "bulk indexing". - Solr / server: v6.2.2 non-solrcloud in a docker with kubernetes java: 1.8.0_151 25.151-b12 HotSpot 64bit | Oracle jvm: heap 30GB os: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux os memory: 230GB | no swap configured os cpu: 32vCPU jvm: "-XX:+UseLargePages", "-XX:LargePageSizeInBytes=2m", "-Xms512m", "-Xmx512m", "-XX:NewRatio=3", "-XX:SurvivorRatio=4", "-XX:TargetSurvivorRatio=90", "-XX:MaxTenuringThreshold=8", "-XX:+UseConcMarkSweepGC", "-XX:+UseParNewGC", "-XX:ConcGCThreads=4", "-XX:ParallelGCThreads=4", "-XX:+CMSScavengeBeforeRemark", "-XX:PretenureSizeThreshold=64m", "-XX:+UseCMSInitiatingOccupancyOnly", "-XX:CMSInitiatingOccupancyFraction=50", "-XX:CMSMaxAbortablePrecleanTime=6000", "-XX:+CMSParallelRemarkEnabled", "-XX:+ParallelRefProcEnabled", non-cloud solr.xml: transientCacheSize = 30 shareSchema = true Also only 4 cores are POSTed to. Client / java8 app: An AsyncHTTPClient POST-ing gzip payloads. PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000) ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs) Gzip payloads: About 800 json messages like this. [ {id:"abcdefx", datetimestamp:"xx", key1:"xx", key2:"z", }, ] POST rate: Each of 4 solr core receives ~32 payloads per second from the custom java app (plugin handler metrics in solr reports the same). Approx ~102,000 docs per sec in total (32 payload x 800 docs x 4 solr cores) Document uniqueness: No doc or id is ever repeated or concurrently sent. No atomic updates needed (overwrite=false in AddUpdateCommand was set in solr handler) Solrconfig.xml For bulk indexing requirement, updatelog and softcommit were minimized / removed. none 200 1 6 ${solr.autoCommit.maxTime:1} true ${solr.autoSoftCommit.maxTime:-1} ${solr.autoSoftCommit.maxDocs:-1} false -M -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html