Thanks Michael! Yes in each shard I have 10 Tlog replicas, no other type of replicas, and each Tlog replica is an individual solr instance on its own physical machine. In the jira you mentioned 'when "last place matches" == "first place matches" – e.g. when shards.preference specified matches *all* available replicas'. My setting is shards.preference=replica.location:local,replica.type:TLOG, I also tried just shards.preference=replica.location:local and it still has the issue. Can you explain a bit more?
On Mon, May 11, 2020 at 12:26 PM Michael Gibney <mich...@michaelgibney.net> wrote: > FYI: https://issues.apache.org/jira/browse/SOLR-14471 > Wei, assuming you have only TLOG replicas, your "last place" matches > (to which the random fallback ordering would not be applied -- see > above issue) would be the same as the "first place" matches selected > for executing distributed requests. > > > On Mon, May 11, 2020 at 1:49 PM Michael Gibney > <mich...@michaelgibney.net> wrote: > > > > Wei, probably no need to answer my earlier questions; I think I see > > the problem here, and believe it is indeed a bug, introduced in 8.3. > > Will file an issue and submit a patch shortly. > > Michael > > > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney > > <mich...@michaelgibney.net> wrote: > > > > > > Hi Wei, > > > > > > In considering this problem, I'm stumbling a bit on terminology > > > (particularly, where you mention "nodes", I think you're referring to > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr > > > server instances) do you have, and what is the replica placement like > > > across those nodes? What, if any, non-TLOG replicas do you have per > > > shard (not that it's necessarily relevant, but just to get a complete > > > picture of the situation)? > > > > > > If you're able without too much trouble, can you determine what the > > > behavior is like on Solr 8.3? (there were different changes introduced > > > to potentially relevant code in 8.3 and 8.4, and knowing whether the > > > behavior you're observing manifests on 8.3 would help narrow down > > > where to look for an explanation). > > > > > > Michael > > > > > > On Fri, May 8, 2020 at 7:34 PM Wei <weiwan...@gmail.com> wrote: > > > > > > > > Update: after I remove the shards.preference parameter from > > > > solrconfig.xml, issue is gone and internal shard requests are now > > > > balanced. The same parameter works fine with solr 7.6. Still not > sure of > > > > the root cause, but I observed a strange coincidence: the nodes that > are > > > > most frequently picked for shard requests are the first node in each > shard > > > > returned from the CLUSTERSTATUS api. Seems something wrong with > shuffling > > > > equally compared nodes when shards.preference is set. Will report > back if > > > > I find more. > > > > > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <weiwan...@gmail.com> wrote: > > > > > > > > > Hi Eric, > > > > > > > > > > I am measuring the number of shard requests, and it's for query > only, no > > > > > indexing requests. I have an external load balancer and see each > node > > > > > received about the equal number of external queries. However for > the > > > > > internal shard queries, the distribution is uneven: 6 nodes > (one in > > > > > each shard, some of them are leaders and some are non-leaders ) > gets about > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of > the shard > > > > > requests. I checked a few other parameters set: > > > > > > > > > > -Dsolr.disable.shardsWhitelist=true > > > > > shards.preference=replica.location:local,replica.type:TLOG > > > > > > > > > > Nothing seems to cause the strange behavior. Any suggestions how > to > > > > > debug this? > > > > > > > > > > -Wei > > > > > > > > > > > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson < > erickerick...@gmail.com> > > > > > wrote: > > > > > > > > > >> Wei: > > > > >> > > > > >> How are you measuring utilization here? The number of incoming > requests > > > > >> or CPU? > > > > >> > > > > >> The leader for each shard are certainly handling all of the > indexing > > > > >> requests since they’re TLOG replicas, so that’s one thing that > might > > > > >> skewing your measurements. > > > > >> > > > > >> Best, > > > > >> Erick > > > > >> > > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <weiwan...@gmail.com> wrote: > > > > >> > > > > > >> > Hi everyone, > > > > >> > > > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My > cloud has 6 > > > > >> > shards with 10 TLOG replicas each shard. After upgrade I > noticed that > > > > >> one > > > > >> > of the replicas in each shard is handling most of the > distributed shard > > > > >> > requests, so 6 nodes are heavily loaded while other nodes are > idle. > > > > >> There > > > > >> > is no change in shard handler configuration: > > > > >> > > > > > >> > <shardHandlerFactory name="shardHandlerFactory" class= > > > > >> > "HttpShardHandlerFactory"> > > > > >> > > > > > >> > <int name="socketTimeout">30000</int> > > > > >> > > > > > >> > <int name="connTimeout">30000</int> > > > > >> > > > > > >> > <int name="maxConnectionsPerHost">500</int> > > > > >> > > > > > >> > </shardHandlerFactory> > > > > >> > > > > > >> > > > > > >> > What could cause the unbalanced internal distributed request? > > > > >> > > > > > >> > > > > > >> > Thanks in advance. > > > > >> > > > > > >> > > > > > >> > > > > > >> > Wei > > > > >> > > > > >> >