Hello, A bit of details: We have 5 node cluster, which we use for configuration distrubution and monitoring active instances of our applications. Each application creates its ephemeral node, so we know which apps are alive, how many of them there is and what they are doing.
The problem had happen at 4th November, first time it was around 4AM, second time around 12PM. First time it was middle of the night when I got woken up, the support guys told me that something is wrong with config distribution. First I've checked apps for errors but didn't find anything interesting, then I looked at what's in zookeeper (using node-zk-browser). I've noticed that there are 3 ephemeral nodes which were created at 1st nov (while the oldest application was started on 3rd nov), I could read its data but was not able to delete them - was getting NONODE exception. I thought wtf - why I cannot delete these nodes, something very bad had to happen with ZK. So I sshed on the leader and using CLI I tried to read these nodes but I was not able to - the leader was telling me that such nodes doesn't exist. After this I started to ssh to the rest of the nodes in cluster and trying to read these nodes. Finally I found the server which did let me read the data of these nodes. Because of the inconsistency I've decided to restart it. Restart did help, everything went back to normal state. The ephemeral nodes disappeared. Similar situation had happen at 12PM but this time I had a lot more time to look what is wrong. Second time the problem was about 3 ephemeral nodes which were created at 1st now (again?). This time I dig a bit deeper and look into logs and 4 letter commands - but could not find anything interesting except the all these 3 nodes were created under different sessionids but zk had no hosts connected under this sessionids. Solution was similar to the one from 4AM but this time I've delete all files in ZK data directory. Oddly enough the problem happened twice on the same ZK node, the final solution was to clear ZK data directory. After clearing the directory the problem didn't happen again. I tried to look for solution/similar problems, I found the posts where people were complaining about ephemeral nodes not being removed after client session gets closed. But I was not able to find posts about ZK not being consistent. What do you think about this? Can we do something to fix this? Sorry for my english, I was doing my best. :) Thanks, Kuba.
