On Thu, Aug 2, 2018 at 9:55 AM Colin McCabe <cmcc...@apache.org> wrote:
> On Wed, Aug 1, 2018, at 11:35, James Cheng wrote: > > I’m a little confused about something. Is this KIP focused on log > > cleaner exceptions in general, or focused on log cleaner exceptions due > > to disk failures? > > > > Will max.uncleanable.partitions apply to all exceptions (including log > > cleaner logic errors) or will it apply to only disk I/o exceptions? > > There is no difference between "log cleaner exceptions in general" and > "log cleaner exceptions due to disk failures." > > For example, if the data on disk is corrupted we might read a 4-byte size > as -1 instead of 100. Then we would get a BufferUnderFlowException later > on. This is a subclass of RuntimeException rather than IOException, of > course, but it does result from a disk problem. Or we might get exceptions > while validating checksums, which may or may not be IOE (I haven't looked). > > Of course, the log cleaner itself may have a bug, which results in it > throwing an exception even if the disk does not have a problem. We clearly > want to fix these bugs. But there's no way for the program itself to know > that it has a bug and act differently. If an exception occurs, we must > assume there is a disk problem. Hey Colin, This is inconsistent with how we deal with disk failures outside of the log cleaner. We should follow the same approach across the board so that we can reason about how the system works. If we think the approach of using specific exception types for disk related errors doesn't work, we should do a KIP for that. For this KIP, I suggest we use the same approach we use to mark disks as offline. Ismael