[ 
https://issues.apache.org/jira/browse/HAWQ-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivram Mani updated HAWQ-1075:
-------------------------------
    Description: 
Currently HdfsTextSimple profile which is the optimized PXF profile to read 
Text/CSV uses ChunkRecordReader to read chunks of records (as opposed to 
individual records). Here dfs.client.read.shortcircuit.skip.checksum is 
explicitly set to true to avoid incurring any delays with checksum check while 
opening/reading the file/block. 

Background Information:
PXF uses a 2 stage process to access HDFS data. 
Stage 1, it fetches all the target blocks for the given file (along with 
replica information). 
Stage 2 (after HAWQ prepares an optimized access plan based on locality), PXF 
agents reads the blocks in parallel.

In almost all scenarios hadoop internally catches block corruption issues and 
such blocks are never returned to any client requesting for block locations 
(Stage 1). In certain scenarios such as a block corruption without change in 
size, Stage1 can still return the location of the corrupted block as well, and 
hence Stage 2 will need to perform an additional checksum check.

With client side checksum check on read (default behavior), we are resilient to 
such checksum errors on read as well.

  was:
Currently HdfsTextSimple profile which is the optimized profile to read 
Text/CSV uses ChunkRecordReader to read chunks of records (as opposed to 
individual records). Here dfs.client.read.shortcircuit.skip.checksum is 
explicitly set to true to avoid incurring any delays with checksum check while 
opening/reading the file/block. 
This configuration needs to be exposed as an option and by default client side 
checksum check must occur in order to be resilient to any data corruption 
issues which aren't caught internally by the datanode block reporting mechanism 
(even fsck doesn't catch certain block corruption issues).


> Restore default behavior of client side(PXF) checksum validation when reading 
> blocks from HDFS
> ----------------------------------------------------------------------------------------------
>
>                 Key: HAWQ-1075
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1075
>             Project: Apache HAWQ
>          Issue Type: Improvement
>          Components: PXF
>            Reporter: Shivram Mani
>            Assignee: Shivram Mani
>
> Currently HdfsTextSimple profile which is the optimized PXF profile to read 
> Text/CSV uses ChunkRecordReader to read chunks of records (as opposed to 
> individual records). Here dfs.client.read.shortcircuit.skip.checksum is 
> explicitly set to true to avoid incurring any delays with checksum check 
> while opening/reading the file/block. 
> Background Information:
> PXF uses a 2 stage process to access HDFS data. 
> Stage 1, it fetches all the target blocks for the given file (along with 
> replica information). 
> Stage 2 (after HAWQ prepares an optimized access plan based on locality), PXF 
> agents reads the blocks in parallel.
> In almost all scenarios hadoop internally catches block corruption issues and 
> such blocks are never returned to any client requesting for block locations 
> (Stage 1). In certain scenarios such as a block corruption without change in 
> size, Stage1 can still return the location of the corrupted block as well, 
> and hence Stage 2 will need to perform an additional checksum check.
> With client side checksum check on read (default behavior), we are resilient 
> to such checksum errors on read as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to