Hi, We are looking at how ZooKeeper handles silent data corruptions resulting from underlying problems in disks and file systems atop them [1,2].
We set up a 3-node ZooKeeper cluster and introduce silent data corruptions to different blocks in the on-disk files. In all the cases, ZooKeeper is able to detect corruptions in the log file using checksums. However, on detecting a corruption, the ZooKeeper node in which corruption occurred crashes instead of trying to fix the corrupted data automatically using the replicas. Why does ZooKeeper not fix the corrupted entry automatically using replicas? What is the reason for this design decision? It would be helpful if anyone could give some insights on this. [1] https://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf [2] http://www.cs.toronto.edu/~bianca/papers/fast08.pdf Thanks, Aishwarya
