I've run into this (or similar) issues in the past (solr6? I don't
remember exactly) where tlogs get stuck either growing indefinitely
and/or refusing to commit on restart.

What I ended up doing was writing a monitor to check for the number of
tlogs and alert if they got over some limit (100 or whatever) and then
I could stay ahead of the issue by rebuilding individual nodes
as-needed.

Are yours growing always, on all nodes, forever?  Or is it one or two
who ends up in a bad state?

On Tue, Feb 16, 2021 at 3:57 PM mmb1234 <m...@vmware.com> wrote:
>
> Looks like the problem is related to tlog rotation on the follower shard.
>
> We did the following for a specific shard.
>
> 0. start solr cloud
> 1. solr-0 (leader), solr-1, solr-2
> 2. rebalance to make solr-1 as preferred leader
> 3. solr-0, solr-1 (leader), solr-2
>
> The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we
> shut the cluster and dropped all shards (manually).
>
> Only way to "restart" tlog rotation on solr-0 (follower) was to issue
> /admin/cores/action=RELOAD&core=xxxxx atleast twice when the tlog size was
> small (in MBs).
>
> Also if rebalance is is issued to select solr-0 as a leader, leader election
> never completes.
>
> solr-0 output after step (3) above.
>
> solr-0
> 2140856 ./data2/mydata_0_e0000000-ffffffff/tlog
> 2140712 ./data2/mydata_0_e0000000-ffffffff/tlog/tlog.0000000000000000021
>
> solr-1 (leader)
> 35268   ./data2/mydata_0_e0000000-ffffffff/tlog
> 35264   ./data2/mydata_0_e0000000-ffffffff/tlog/tlog.0000000000000000055
>
> solr-2
> 35256   ./data2/mydata_0_e0000000-ffffffff/tlog
> 35252   ./data2/mydata_0_e0000000-ffffffff/tlog/tlog.0000000000000000054
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to