I recently observed some strange behavior from a multi-sharded TLOG/PULL Solr cloud (running 9.8.1), where the cloud was intermittently returning out-of-date results for an extended period of time. At first, the out-of-date results seemed like a mysterious potential bug, because the cloud was reporting that all replicas were active. After digging into it, I eventually found that this cloud had a PULL replica core which had been experiencing hours of continuous replication failures. Since it was hard to tell the exact state of the core from the logs, I was only able to confirm that the replica was out of sync by checking the timesFailed field in the core’s replication.properties file.
As I understand it, PULL replicas are designed to be resilient against leader election failures, so once they are active they will continue to serve query traffic even if replicating from the leader fails. However, my observations from this 9.8.1 cloud have left me with a couple of questions: 1)Would it make sense to add some logs/telemetry to allow SolrCloud users to track PULL node replication better? Or would that type of change conflict with the purpose of PULL nodes (i.e. if users want more consistent query results, they shouldn’t use PULLs)? 2)Has anyone else observed these back-to-back replication failures with Solr 9 clouds? If so, have you adopted any particular strategies to track them?
