I've run into this (or similar) issues in the past (solr6? I don't remember exactly) where tlogs get stuck either growing indefinitely and/or refusing to commit on restart.
What I ended up doing was writing a monitor to check for the number of tlogs and alert if they got over some limit (100 or whatever) and then I could stay ahead of the issue by rebuilding individual nodes as-needed. Are yours growing always, on all nodes, forever? Or is it one or two who ends up in a bad state? On Tue, Feb 16, 2021 at 3:57 PM mmb1234 <m...@vmware.com> wrote: > > Looks like the problem is related to tlog rotation on the follower shard. > > We did the following for a specific shard. > > 0. start solr cloud > 1. solr-0 (leader), solr-1, solr-2 > 2. rebalance to make solr-1 as preferred leader > 3. solr-0, solr-1 (leader), solr-2 > > The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we > shut the cluster and dropped all shards (manually). > > Only way to "restart" tlog rotation on solr-0 (follower) was to issue > /admin/cores/action=RELOAD&core=xxxxx atleast twice when the tlog size was > small (in MBs). > > Also if rebalance is is issued to select solr-0 as a leader, leader election > never completes. > > solr-0 output after step (3) above. > > solr-0 > 2140856 ./data2/mydata_0_e0000000-ffffffff/tlog > 2140712 ./data2/mydata_0_e0000000-ffffffff/tlog/tlog.0000000000000000021 > > solr-1 (leader) > 35268 ./data2/mydata_0_e0000000-ffffffff/tlog > 35264 ./data2/mydata_0_e0000000-ffffffff/tlog/tlog.0000000000000000055 > > solr-2 > 35256 ./data2/mydata_0_e0000000-ffffffff/tlog > 35252 ./data2/mydata_0_e0000000-ffffffff/tlog/tlog.0000000000000000054 > > > > -- > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html