[jira] Commented: (HDFS-755) Read multiple checksum chunks at once in DFSInputStream

Todd Lipcon (JIRA) Wed, 23 Dec 2009 06:13:02 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794055#action_12794055
 ]


Todd Lipcon commented on HDFS-755:
----------------------------------

bq. User code should use buffering for application specific reasons. May be 
'bufferSize' argument for FSInputStream is flawed to start with.

Personally, I agree, but I think it's out of scope for this JIRA to fix that.

bq. My impression is that main purpose of this patch is to reduce a copy. 
keeping the large buffer prohibits that.

That's true, but I think we need to thoroughly benchmark SequenceFile.Reader 
there, and do it in a separate JIRA. This one as it stands is not a breaking 
change, in that it should not reduce performance for any workload. Having a 
small internal buffer can potentially be breaking, so we should benchmark how 
big that break could be and weigh it vs the improvements.

Aside from making a smaller internal buffer, there are a couple other options 
that might be less "dangerous" - eg using a small buffer for the initial reads, 
then creating a _new_ BufferedInputStream with a fresh buffer to start the data 
reads. This would get rid of the "misalignment" issue here. ChecksumFileSystem 
has this same problem, so introducing our own BufferedInputStream 
implementation that has some tricks to re-align its reads against the buffer.

bq. Even when a sequencefile has very small records (avg < 1k?)

I've seen SequenceFiles used for even smaller records - down to a few bytes (eg 
IntWritable keys and values). Syscalls are cheap but not *that* cheap compared 
to an 8-byte copy. So, I don't think we should do optimizatinos that would 
destroy performance of this scenario.

bq. ...but not been able to see improvement. will verify if I am really running 
the patch. 

Did you run this patch with a core jar that was compiled with HADOOP-3205? To 
test, you need to do "ant -Dresolvers=internal mvn-install" from Common, with 
HADOOP-3205 applied. Then, in the HDFS tree, "ant -Dresolvers=internal 
clean-cache binary" to make sure it pulls your local common build.

> Read multiple checksum chunks at once in DFSInputStream
> -------------------------------------------------------
>
>                 Key: HDFS-755
>                 URL: https://issues.apache.org/jira/browse/HDFS-755
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs client
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: alldata-hdfs.tsv, benchmark-8-256.png, benchmark.png, 
> hdfs-755.txt, hdfs-755.txt, hdfs-755.txt, hdfs-755.txt, hdfs-755.txt
>
>
> HADOOP-3205 adds the ability for FSInputChecker subclasses to read multiple 
> checksum chunks in a single call to readChunk. This is the HDFS-side use of 
> that new feature.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HDFS-755) Read multiple checksum chunks at once in DFSInputStream

Reply via email to