[ https://issues.apache.org/jira/browse/KUDU-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508309#comment-16508309 ]
Andrew Wong edited comment on KUDU-2469 at 6/11/18 4:25 PM: ------------------------------------------------------------ The difficulty with failing a specific tablet from a CFile error is that CFileReader (the component that yields the checksum error) is unaware of the tablet to which it belongs. Plumbing the tablet id to the CFiles seems excessive considering how many CFiles we might expect in a tablet server. Alternatively, we might want to audit of the current usages of CFileReader::Init() (which is where the checksum currently fails) and catch these errors at the tablet layer, where the tablet id is known. Another approach might attempt to trigger the Fs::ReadableBlock's (or its underlying log block container's) disk error handling when returning with a CFile checksum error. Given what's currently in place, this would fail all tablets configured to stripe data across the directory in which the block resides, which is much coarser grained than the behavior described in the Jira. was (Author: andrew.wong): The difficulty with failing a specific tablet from a CFile error is that CFileReaders (the component that yields the checksum error) is unaware of the tablet to which it belongs. Plumbing the tablet id to the CFiles seems excessive considering how many CFiles we might expect in a tablet server. We might want to audit of the current usages of CFileReader::Init() and catch these errors at the tablet layer. Another approach might attempt to trigger the Fs::ReadableBlock's (or its underlying log block container's) disk error handling when returning with a CFile checksum error. Given what's currently in place, this would fail all tablets configured to stripe data across the directory in which the block resides, which is much coarser grained than the behavior described in the Jira. > Handle CFile checksum failures > ------------------------------ > > Key: KUDU-2469 > URL: https://issues.apache.org/jira/browse/KUDU-2469 > Project: Kudu > Issue Type: Improvement > Components: cfile, tablet > Reporter: Andrew Wong > Priority: Major > > Today, there is no special handling for CFile checksum failures, other than > returning an error. It would be nice if the behavior for such a failure > marked the tablet as "failed": making it unavailable for reads, marking it > for eviction/re-replication, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)