Hi All,

In our Solr Cloud cluster (8.4.1) sometimes committed documents are not
visible to subsequent requests sent after a, apprently, sucessful
commit(waitFlush=true, wait=searcherTrue). This behaviour does not happen
if all nodes are stable, but will happen eventually if we kill off random
nodes using a chaosMonkey script.

Accordung to solrs documentation, a commit with openSearcher=true and
waitSearcher=true and waitFlush=true only returns once everything is
presisted AND the new searcher is visible.

To me this sounds like that any subsequent request after a successful
commit MUST hit the new searcher and is guaranteed to see the commit
changes, regardless of node failures or restarts.

Is this assumption on strong-consistency for commits with
openSearcher=true, waitSearcher=true and waitFlush=true correct?

If so, we discoverd a bug.

TestSetup:
============
Infrastructure

   - 3 Solr  (8.4.1) Nodes in Docker Containers
   - Each Solr node on its own Host (same hosts run the 3 Zookeeper nodes)
   - Persistent Host Volume is mounted inside the DockerContainer
   - Solr instances are pinned to host.
   - A test-collection with 1 Shard and 2 NRT Replicas.
   - Using Solrj (8.4.1)  and CloudSolrClient for communication.
   - Containers are automatically restarted on errors
   - autoCommit maxDocs10000 openSearcher=false
   - autoSoftCommit -never-
   - (We fairly often commit ourself)
   - the solrconfig.xml <https://pastebin.com/kJpGh3yj>


Scenario
After adding an initial batch of documents we perform multiple
"transactions".
Each "transaction" adds, modifys and deletes documents and we ensure that
each response has a "rf=2" (achieved replication factor=2) attribute.
A transaction has to be set atomically visible or not.
We achieve this by storing a CurrentVersion counter attribute in each
document.
This makes our life easier verifiying this corner case, as we can search
and count all documents having a specific transaction-id-counter value.
After a "transaction" was performed without errors we send first a
hardCommit and then a softCommit, both with waitFlush=true,
waitSearcher=true and ensure they both return without errors.
Only after everything happend without errors, we start to verifiy
visibility and correctness of the commited "transaction" by sending
counting queries against solr, filtering on our transaction-id-counter.
This works fine, as long as all nodes are stable. However ..

ErrorCase
If we periodically kill (SIGTERM) random solr nodes every 30, eventually
the aforementioned visibility gurantees after comit(waitFlush=true,
waitSearcher=true) break and documents that should be there/visible are not.
Sometimes this happens after minutes, somtimes it takes hours to hit this
case.

In the error case the verification counting queries return with ZERO hits.

We suspect that commits do not reach all replicas or that commits are
lost/ignored.
Unfortunatly a commit request do not include the "rf" attribute in their
responsen, which would allow us to assert the achieved replication factor.

We hope someone has an idea or clue how to fix this or why this happens, as
this is a showstopper for us and we require strong-consistency gurantees.
(Once a commit was sucessfull at time T, ALL subsequent requests after T
MUST see the new documents)


Some notes:

   - the obeserved errors can be reproduced regardless of these settings in
   the solrconfig.xml <https://pastebin.com/kJpGh3yj>
      - useColdSearcher=true/false
      - cache's autowarmCount=0 or any other value
      - Errors appear to happen more frequently if we have more load (more
   collections with the same test)


Cheers,
Michael

Reply via email to