Hi, I am not aware if this was a design decision, to be honest. AFAIK, this has been a long standing bug. :( I have compiled a handful of JIRA issues that are basically this problem scattered through multiple repetitive issues. Gonna aggregate them soon, I hope. We, the community, should raise the priority and tackle this issue to make ZK server more resilient and robust in the face of logs/txn files corruption. Any suggestion is more than welcome, by the way!
Cheers, Eddie On Fri, Jan 6, 2017 at 6:38 PM, Aishwarya Ganesan <ash8as...@gmail.com> wrote: > Hi, > > We are looking at how ZooKeeper handles silent data corruptions resulting > from underlying problems in disks and file systems atop them [1,2]. > > We set up a 3-node ZooKeeper cluster and introduce silent data corruptions > to different blocks in the on-disk files. In all the cases, ZooKeeper is > able to detect corruptions in the log file using checksums. > > However, on detecting a corruption, the ZooKeeper node in which corruption > occurred crashes instead of trying to fix the corrupted data automatically > using the replicas. Why does ZooKeeper not fix the corrupted entry > automatically using replicas? What is the reason for this design decision? > It would be helpful if anyone could give some insights on this. > > [1] https://research.cs.wisc.edu/wind/Publications/zfs- > corruption-fast10.pdf > [2] http://www.cs.toronto.edu/~bianca/papers/fast08.pdf > > Thanks, > Aishwarya >