thanks Shawn.
autoCommit is enabled and it also has openSearcher set to true because in TLOG
/ PULL replicas there is no softCommit and therefore we need to open a new
searcher during autoCommit.
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
<maxDocs>${solr.autoCommit.maxDocs:1000}</maxDocs>
<openSearcher>true</openSearcher>
</autoCommit>
When tried to reload the collection, the node in question (node-7) timed out
without any errors (general timeout, 180s).
We have multiple clusters that run similar setup (difference is the # of nodes,
docs and size of the nodes), none of them ended up in such a weird state.
This is a bit worrying as in bigger clusters, without proper monitoring and
alerting[*], one might end up serving outdated content.
We are planning to upgrade to 9.2.1 and actively monitor the state of the
nodes..
[*] - which we still need to figure it out which metrics could tell us that the
active index is lagging behind the leader; we got an idea though, basically,
"sum(rate(solr_metrics_core_searcher_documents{namespace!=“"}[10m])) by (pod,
collection)” which could give us at least some understanding of the index state
on each node // careful: this will work only if you have continuous updates to
the collections. Perhaps, anyone has better ideas on how to monitor the lag of
active index?
Thanks,
---
Nick Vladiceanu
[email protected]
> On 9. Jun 2023, at 10:34, Shawn Heisey <[email protected]> wrote:
>
> On 6/9/23 01:43, Nick Vladiceanu wrote:
>> We noticed that we get inconsistent results for the same query if run
>> multiple times. Out of 4 requests, one of them was returning empty response
>> when we were running “/select?q=id:12345&distrib=true”.
>> Started checking each core and we noticed that the core on node-7 had "Last
>> Modified: 9 days ago” (Solr UI -> selected the core -> Overview). On the
>> right side, "Instance details" were showing that we are using “Index:
>> /var/solr/data/collection_0_shard2_replica_t15/data/index.20230530170400660”.
>> Something is wrong.
>
> I suspect that you may have turned off autoCommit. If so, that's a bad idea.
>
> The solrconfig.xml should always have autoCommit configured with a relatively
> short maxTime. The configs Solr ships with have maxTime set to 15000
> milliseconds. In most cases, the autoCommit should have openSearcher set to
> false. I personally increase the maxTime to 60000 milliseconds, so there is
> less overall system load, but the 15 second interval in the example configs
> works very well. If it didn't, it wouldn't be in the example configs.
>
> TLOG and PULL followers query the leader for changes on an interval that's
> half of the autoCommit maxTime setting.
>
> Note that autoCommit serves a very different purpose than autoSoftCommit. If
> you're going to disable one of them, it should be autoSoftCommit.
>
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> If you haven't disabled autoCommit then I don't know what might be wrong.
>
> You should upgrade to 9.2.1. The list of bugs fixed between 9.1 and 9.2 is
> very extensive, and I have actually run into a number of them on 9.1.1.
>
> I have never understood what makes Solr use an "index.NNNNNNNN" directory
> instead of just "index" or when it switches to a new directory. I know it
> has something to do with replication, which is the Solr feature that
> SolrCloud uses to copy TLOG/PULL replica data.
>
> If you are finding that you've got extra data in some of your replicas from
> multiple index directories, just reload the collection. That should get
> straightened out.
>
> Thanks,
> Shawn