Hi,
I run a small 3-node test cluster with Solr Operator and Solr 9.6.1. Have
configured the affinity placement plugin as follows
{
"plugin": {
".placement-plugin": {
"name": ".placement-plugin",
"class":
"org.apache.solr.cluster.placement.plugins.AffinityPlacementFactory",
"config": {"minimalFreeDiskGB":2,"prioritizedFreeDiskGB":100}
}
}
}
There is plenty of free disk and all three PODs are healthy.
Now I can create one or a few collections with 3 NRT replicas successfully. The
affinity plugin makes sure that each replica is on different PODs (as opposed
to the default which is round-robin). Also, if one of the PODs is down, the
plugin thows an error so client can re-try creating the collection once all
three PODs are online.
Now, after some time, creating another collection fails with message "Not
enough eligible nodes to place 3 replica(s) of type NRT for shard shard1 of
collection foo", even if the cluster is healthy with three nodes online and all
three nodes listed in "live_nodes". The full stack trace is here
https://gist.github.com/janhoy/a50e48d93be6b849cbf0a6722a89ba21
Looks like the OrderedNodePlacementPlugin somehow believes that two nodes are
down or otherwise not eligible.
I have to restart/delete one or two PODs for it to work again. I first thought
it would be enough to restart the overseer node, but last I tried, the error
mssage only became worse: "Only able to place 0 replicas". One or two more
restarts may make it work again, before it again becomes locked.
Debug logging does not reveal much more.
I see a few similar test failures in builds mailing list:
- BATS test "Affinity placement plugin using sysprop" fails three times in 2023
- PlacementPluginIntegrationTest fails tree times in 2023 and once June 1st
Anyone have any insight?
Jan