I recently observed some strange behavior from a multi-sharded TLOG/PULL Solr 
cloud (running 9.8.1), where the cloud was intermittently returning out-of-date 
results for an extended period of time. At first, the out-of-date results 
seemed like a mysterious potential bug, because the cloud was reporting that 
all replicas were active. After digging into it, I eventually found that this 
cloud had a PULL replica core which had been experiencing hours of continuous 
replication failures. Since it was hard to tell the exact state of the core 
from the logs, I was only able to confirm that the replica was out of sync by 
checking the timesFailed field in the core’s replication.properties file.


As I understand it, PULL replicas are designed to be resilient against leader 
election failures, so once they are active they will continue to serve query 
traffic even if replicating from the leader fails. However, my observations 
from this 9.8.1 cloud have left me with a couple of questions:


  1)Would it make sense to add some logs/telemetry to allow SolrCloud users to 
track PULL node replication better? Or would that type of change conflict with 
the purpose of PULL nodes (i.e. if users want more consistent query results, 
they shouldn’t use PULLs)?


  2)Has anyone else observed these back-to-back replication failures with Solr 
9 clouds? If so, have you adopted any particular strategies to track them?

Reply via email to