[ https://issues.apache.org/jira/browse/HAWQ-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shivram Mani updated HAWQ-1075: ------------------------------- Description: Currently HdfsTextSimple profile which is the optimized PXF profile to read Text/CSV uses ChunkRecordReader to read chunks of records (as opposed to individual records). Here dfs.client.read.shortcircuit.skip.checksum is explicitly set to true to avoid incurring any delays with checksum check while opening/reading the file/block. Background Information: PXF uses a 2 stage process to access HDFS data. Stage 1, it fetches all the target blocks for the given file (along with replica information). Stage 2 (after HAWQ prepares an optimized access plan based on locality), PXF agents reads the blocks in parallel. In almost all scenarios hadoop internally catches block corruption issues and such blocks are never returned to any client requesting for block locations (Stage 1). In certain scenarios such as a block corruption without change in size, Stage1 can still return the location of the corrupted block as well, and hence Stage 2 will need to perform an additional checksum check. With client side checksum check on read (default behavior), we are resilient to such checksum errors on read as well. was: Currently HdfsTextSimple profile which is the optimized profile to read Text/CSV uses ChunkRecordReader to read chunks of records (as opposed to individual records). Here dfs.client.read.shortcircuit.skip.checksum is explicitly set to true to avoid incurring any delays with checksum check while opening/reading the file/block. This configuration needs to be exposed as an option and by default client side checksum check must occur in order to be resilient to any data corruption issues which aren't caught internally by the datanode block reporting mechanism (even fsck doesn't catch certain block corruption issues). > Restore default behavior of client side(PXF) checksum validation when reading > blocks from HDFS > ---------------------------------------------------------------------------------------------- > > Key: HAWQ-1075 > URL: https://issues.apache.org/jira/browse/HAWQ-1075 > Project: Apache HAWQ > Issue Type: Improvement > Components: PXF > Reporter: Shivram Mani > Assignee: Shivram Mani > > Currently HdfsTextSimple profile which is the optimized PXF profile to read > Text/CSV uses ChunkRecordReader to read chunks of records (as opposed to > individual records). Here dfs.client.read.shortcircuit.skip.checksum is > explicitly set to true to avoid incurring any delays with checksum check > while opening/reading the file/block. > Background Information: > PXF uses a 2 stage process to access HDFS data. > Stage 1, it fetches all the target blocks for the given file (along with > replica information). > Stage 2 (after HAWQ prepares an optimized access plan based on locality), PXF > agents reads the blocks in parallel. > In almost all scenarios hadoop internally catches block corruption issues and > such blocks are never returned to any client requesting for block locations > (Stage 1). In certain scenarios such as a block corruption without change in > size, Stage1 can still return the location of the corrupted block as well, > and hence Stage 2 will need to perform an additional checksum check. > With client side checksum check on read (default behavior), we are resilient > to such checksum errors on read as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)