[ 
https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HADOOP-3205:
--------------------------------

    Attachment: hadoop-3205.txt

Here's a patch which fixes the bugs that caused the unit test failures.

There's one TODO still in the code to figure out a good setting for MAX_CHUNKS 
(ie the max number of checksum chunks that should be read in one call to the 
underlying stream).

This is still TODO since I made an odd discovery about this - the logic we were 
going on here was that the performance improvement was due to an eliminated 
buffer copy when the size of the read where >= the size of the buffer in the 
underlying BufferedInputStream. This would mean that the correct size for 
MAX_CHUNKS is ceil(io.file.buffer.size / 512) (ie 256 for a 128KB buffer I was 
testing with). If MAX_CHUNKS is less than that, then reads to the BIS would be 
less than its buffer size and thus you'd incur a copy.

However, my benchmarking shows that this *isn't* the performance gain. Even 
with MAX_CHUNKS set to 4, there's a significant performance gain over 
MAX_CHUNKS set to 1. There is no significant difference between MAX_CHUNKS=127 
and MAX_CHUNKS=128 for a 64K buffer, whereas the understanding above would 
indicate that 128 would eliminate a copy whereas 127 would not.

So, I think this is actually improving performance because of some other effect 
like better cache locality by operating in larger chunks. Admittedly, cache 
locality is always the fallback excuse for a performance increase, but I don't 
have a better explanation yet. Anyone care to hazard a guess?

> Read multiple chunks directly from FSInputChecker subclass into user buffers
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-3205
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3205
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>            Reporter: Raghu Angadi
>            Assignee: Todd Lipcon
>         Attachments: hadoop-3205.txt, hadoop-3205.txt
>
>
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have 
> access to full user buffer. At any time DFS can access only up to 512 bytes 
> even though user usually reads with a much larger buffer (often controlled by 
> io.file.buffer.size). This requires implementations to double buffer data if 
> an implementation wants to read or write larger chunks of data from 
> underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two 
> separate jiras.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to