Hi,

I am not aware if this was a design decision, to be honest. AFAIK, this has
been a long standing bug. :( I have compiled a handful of JIRA issues that
are basically this problem scattered through multiple repetitive issues.
Gonna aggregate them soon, I hope. We, the community, should raise the
priority and tackle this issue to make ZK server more resilient and robust
in the face of logs/txn files corruption. Any suggestion is more than
welcome, by the way!

Cheers,
Eddie


On Fri, Jan 6, 2017 at 6:38 PM, Aishwarya Ganesan <ash8as...@gmail.com>
wrote:

> Hi,
>
> We are looking at how ZooKeeper handles silent data corruptions resulting
> from underlying problems in disks and file systems atop them [1,2].
>
> We set up a 3-node ZooKeeper cluster and introduce silent data corruptions
> to different blocks in the on-disk files. In all the cases, ZooKeeper is
> able to detect corruptions in the log file using checksums.
>
> However, on detecting a corruption, the ZooKeeper node in which corruption
> occurred crashes instead of trying to fix the corrupted data automatically
> using the replicas. Why does ZooKeeper not fix the corrupted entry
> automatically using replicas? What is the reason for this design decision?
> It would be helpful if anyone could give some insights on this.
>
> [1] https://research.cs.wisc.edu/wind/Publications/zfs-
> corruption-fast10.pdf
> [2] http://www.cs.toronto.edu/~bianca/papers/fast08.pdf
>
> Thanks,
> Aishwarya
>

Reply via email to