[jira] [Commented] (HDFS-4710) Turning off HDFS short-circuit checksums unexpectedly slows down Hive

Colin Patrick McCabe (JIRA) Wed, 24 Apr 2013 14:23:16 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640970#comment-13640970
 ]


Colin Patrick McCabe commented on HDFS-4710:
--------------------------------------------

I do agree that {{BufferedFSInputStream}} could be used to work around the 
problem.  Unfortunately it's fairy bugged at the moment due to HADOOP-9307.

I think the fix here should be in {{BlockReaderLocal}}.  It's just inconsistent 
that we buffer when checksums are enabled, but not when they are disabled.  
Especially given that we have an explicit parameter for setting the buffer 
size, which we are ignoring at the moment in no-checksum mode.
                
> Turning off HDFS short-circuit checksums unexpectedly slows down Hive
> ---------------------------------------------------------------------
>
>                 Key: HDFS-4710
>                 URL: https://issues.apache.org/jira/browse/HDFS-4710
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.0.4-alpha
>         Environment: Centos (EC2) + short-circuit reads on
>            Reporter: Gopal V
>            Priority: Minor
>              Labels: perfomance
>
> When short-circuit reads are on, HDFS client slows down when checksums are 
> turned off.
> With checksums on, the query takes 45.341 seconds and with it turned off, it 
> takes 56.345 seconds. This is slower than the speeds observed when 
> short-circuiting is turned off.
> The issue seems to be that FSDataInputStream.readByte() calls are directly 
> transferred to the disk fd when the checksums are turned off.
> Even though all the columns are integers, the data being read will be read 
> via DataInputStream which does
> {code}
> public final int readInt() throws IOException {
>         int ch1 = in.read();
>         int ch2 = in.read();
>         int ch3 = in.read();
>         int ch4 = in.read();
> {code}
> To confirm, an strace of the Yarn container shows
> {code}
> 26690 read(154, "B", 1)                 = 1
> 26690 read(154, "\250", 1)              = 1
> 26690 read(154, ".", 1)                 = 1
> 26690 read(154, "\24", 1)               = 1
> {code}
> To emulate this without the entirety of Hive code, I have written a simpler 
> test app 
> https://github.com/t3rmin4t0r/shortcircuit-reader
> The jar will read a file in -bs <n> sized buffers. Running it with 1 byte 
> blocks gives similar results to the Hive test run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4710) Turning off HDFS short-circuit checksums unexpectedly slows down Hive

Reply via email to