Hello,
I have 8 node cluster, under heavy load a tserver goes down, we have systemd
unit file to auto restart, but that causes unassigned tablet for an hour.
In the log of restarted tserver i see
WARN: Saw (possibly) transient exception communicating with zookeeper
and then error
KeeperErrorCode = ConnectionLoss for /accumulo/<instance >/xxx
KeeperErrroCode = ConnectionLoss
at KeeperExcetion.create(KeeperException.java:102)
at KeeperExcetion.create(KeeperException.java:54)
at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
at
org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.java:159)
xxxxx
Any suggestions?
-S