Update: after I remove the shards.preference parameter from solrconfig.xml, issue is gone and internal shard requests are now balanced. The same parameter works fine with solr 7.6. Still not sure of the root cause, but I observed a strange coincidence: the nodes that are most frequently picked for shard requests are the first node in each shard returned from the CLUSTERSTATUS api. Seems something wrong with shuffling equally compared nodes when shards.preference is set. Will report back if I find more.
On Mon, Apr 27, 2020 at 5:59 PM Wei <weiwan...@gmail.com> wrote: > Hi Eric, > > I am measuring the number of shard requests, and it's for query only, no > indexing requests. I have an external load balancer and see each node > received about the equal number of external queries. However for the > internal shard queries, the distribution is uneven: 6 nodes (one in > each shard, some of them are leaders and some are non-leaders ) gets about > 80% of the shard requests, the other 54 nodes gets about 20% of the shard > requests. I checked a few other parameters set: > > -Dsolr.disable.shardsWhitelist=true > shards.preference=replica.location:local,replica.type:TLOG > > Nothing seems to cause the strange behavior. Any suggestions how to > debug this? > > -Wei > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <erickerick...@gmail.com> > wrote: > >> Wei: >> >> How are you measuring utilization here? The number of incoming requests >> or CPU? >> >> The leader for each shard are certainly handling all of the indexing >> requests since they’re TLOG replicas, so that’s one thing that might >> skewing your measurements. >> >> Best, >> Erick >> >> > On Apr 27, 2020, at 7:13 PM, Wei <weiwan...@gmail.com> wrote: >> > >> > Hi everyone, >> > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6 >> > shards with 10 TLOG replicas each shard. After upgrade I noticed that >> one >> > of the replicas in each shard is handling most of the distributed shard >> > requests, so 6 nodes are heavily loaded while other nodes are idle. >> There >> > is no change in shard handler configuration: >> > >> > <shardHandlerFactory name="shardHandlerFactory" class= >> > "HttpShardHandlerFactory"> >> > >> > <int name="socketTimeout">30000</int> >> > >> > <int name="connTimeout">30000</int> >> > >> > <int name="maxConnectionsPerHost">500</int> >> > >> > </shardHandlerFactory> >> > >> > >> > What could cause the unbalanced internal distributed request? >> > >> > >> > Thanks in advance. >> > >> > >> > >> > Wei >> >>