[ 
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551755#comment-13551755
 ] 

Todd Lipcon commented on HDFS-347:
----------------------------------

I believe all of the component pieces have now been committed to the HDFS-347 
branch. I ran a number of benchmarks yesterday on the branch in progress, and 
just re-confirmed the results from the code committed in SVN. Here's a report 
of the benchmarks and results:

h1. Benchmarks

To validate the branch, I ran a series of before/after benchmarks, specifically 
focused on random-read. In particular, I ran benchmarks based on 
TestParallelRead, which has different variants which run the same workload 
through the different read paths.

On the trunk ("before") branch, I ran TestParallelRead (normal read path) and 
TestParallelLocalRead (read path based on HDFS-2246). On the HDFS-347 branch, I 
ran TestParallelRead (normal read path) and TestParallelShortCircuitRead (new 
short-circuit path).

I made the following modifications to the test cases to act as a better 
benchmark:

1) Modified to 0% PROPORTION_NON_READ:

Without this modification, I found that both the 'before' and 'after' tests 
became lock-bound, since the 'seek-and-read' workload holds a lock on the 
DFSInputStream. So, this obscured the actual performance differences between 
the data paths.

2) Modified to 30,000 iterations

Simply jacked up the number of iterations to get more reproducible results and 
ensure that the JIT had plenty of time to kick in (the benchmarks ran for 
~50seconds each with this change instead of only ~5sec)

3) Added a variation which has two target blocks

I had a thought that there could potentially be a regression for workloads 
which frequently switch back and forth between two different blocks of the same 
file. This variation is the same test, but with the DFS Block Size set to 
128KB, so that the 256KB test file is split into two equal sized blocks. This 
causes a good percentage of the random reads to span block boundaries, and 
ensures that the various caches in the code work OK even when moving between 
different blocks.

h2. Comparing non-local read

When the new code path is disabled, or when the DN is not local, we continue to 
use the existing code path. We expect that this code path's performance should 
be unaffected.

Results:

|| Test || #Threads || #Files|| Trunk MB/sec || HDFS-347 MB/sec ||
|| TestParallelRead | 4  | 1 | 428.4 | 423.0 |
|| TestParallelRead | 16 | 1 | 669.5 | 651.1 |
|| TestParallelRead | 8  | 2 | 603.4 | 582.7 |
|| TestParallelRead 2-blocks | 4 | 1  | 354.0 | 345.9 |
|| TestParallelRead 2-blocks | 16 | 1 | 534.9 | 520.0 |
|| TestParallelRead 2-blocks | 8 | 2  | 483.1 |  460.8 |

The above numbers seem to show a 2-4% regression, but I think it's within the 
noise on my machine (other software was running, etc). Colin also has one or 
two ideas for micro-optimizations which might win back a couple percent here 
and there, if it's not just noise.

To put this in perspective, here are results for the same test against branch-1:

|| Test || #Threads || #Files|| Branch-1 ||
|| TestParallelRead | 4  | 1 | 229.7 |
|| TestParallelRead | 16 | 1 | 264.4 |
|| TestParallelRead | 8  | 2 | 260.1 |

(so trunk is 2-3x as fast as branch-1)

h2. Comparing local read

Here we expect the performance to be as good or better than the old (HDFS-2246) 
implementation. Results:

|| Test || #Threads || #Files|| Trunk MB/sec || HDFS-347 MB/sec ||
|| TestParallelLocalRead | 4  | 1 | 901.4  | 1033.6 |
|| TestParallelLocalRead | 16 | 1 | 1079.8 | 1203.9 |
|| TestParallelLocalRead | 8  | 2 | 1087.4 | 1274.0 |
|| TestParallelLocalRead 2-blocks | 4  | 1 | 856.6  | 919.2 |
|| TestParallelLocalRead 2-blocks | 16 | 1 | 1045.8 |  1137.0 |
|| TestParallelLocalRead 2-blocks | 8  | 2 | 966.7  |  1392.9 |

The result shows that the new implementation is indeed between 10% and 44% 
faster than the HDFS-2246 implementation. We're theorizing that the reason is 
because the old implementation would cache block paths, but not open file 
descriptors. So, because every positional read creates a new BlockReader, it 
would have to issue new {{open()}} syscalls, even if the location was cached.

h2. Comparing sequential read

I used the BenchmarkThroughput tool, configured to write a 1GB file, and then 
read it back 100 times. This ensures that it's in buffer cache, so that we're 
benchmarking CPU overhead (since the actual disk access didn't change in the 
patch, and we're looking for a potential regression in CPU resource usage). I 
recorded the MB/sec rate for the short-circuit before and short-circuit after, 
and then loaded the data into R and ran a T-test:

{code}
> d.before <- read.table(file="/tmp/before-patch.txt")
> d.after <- read.table(file="/tmp/after-patch.txt")
> t.test(d.before, d.after)
> d.before <- read.table(file="/tmp/before-patch.txt")
> d.after <- read.table(file="/tmp/after-patch.txt")
> t.test(d.before, d.after)

        Welch Two Sample t-test

data:  d.before and d.after 
t = 0.5936, df = 199.777, p-value = 0.5535
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -62.39975 116.14431 
sample estimates:
mean of x mean of y 
 2939.456  2912.584 
{code}

The p-value 0.55 means that there's no statistically significant difference in 
the performance of the two data paths for sequential workloads.

I did the same thing with short-circuit disabled and got the following t-test 
results for the RemoteBlockReader code path:

{code}
> d.before <- read.table(file="/tmp/before-patch-rbr.txt")
> d.after <- read.table(file="/tmp/after-patch-rbr.txt")
> t.test(d.before, d.after)

        Welch Two Sample t-test

data:  d.before and d.after 
t = 1.155, df = 199.89, p-value = 0.2495
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -18.69172  71.54320 
sample estimates:
mean of x mean of y 
 1454.653  1428.228 
{code}

Again, the p-value 0.25 means there's no significant difference in performance.

h2. Summary

The patch provides a good speedup (up to 40% in one case) for some random read 
workloads, and has no discernible negative impact on others.

                
> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, hdfs-client, performance
>            Reporter: George Porter
>            Assignee: Colin Patrick McCabe
>         Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, 
> HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, 
> HDFS-347.016.patch, HDFS-347.017.clean.patch, HDFS-347.017.patch, 
> HDFS-347.018.clean.patch, HDFS-347.018.patch2, HDFS-347.019.patch, 
> HDFS-347.020.patch, HDFS-347.021.patch, HDFS-347.022.patch, 
> HDFS-347.024.patch, HDFS-347.025.patch, HDFS-347.026.patch, 
> HDFS-347.027.patch, HDFS-347.029.patch, HDFS-347.030.patch, 
> HDFS-347.033.patch, HDFS-347.035.patch, HDFS-347-branch-20-append.txt, 
> hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to 
> move the code to the data.  However, putting the DFS client on the same 
> physical node as the data blocks it acts on doesn't improve read performance 
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem 
> is due to the HDFS streaming protocol causing many more read I/O operations 
> (iops) than necessary.  Consider the case of a DFSClient fetching a 64 MB 
> disk block from the DataNode process (running in a separate JVM) running on 
> the same machine.  The DataNode will satisfy the single disk block request by 
> sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java, 
> this is done in the sendChunk() method, relying on Java's transferTo() 
> method.  Depending on the host O/S and JVM implementation, transferTo() is 
> implemented as either a sendfilev() syscall or a pair of mmap() and write().  
> In either case, each chunk is read from the disk by issuing a separate I/O 
> operation for each chunk.  The result is that the single request for a 64-MB 
> block ends up hitting the disk as over a thousand smaller requests for 64-KB 
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode, 
> shuttling data from the disk to the DFSClient also results in context 
> switches each time network packets get sent (in this case, the 64-kb chunk 
> turns into a large number of 1500 byte packet send operations).  Thus we see 
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think 
> providing a mechanism for a DFSClient to directly open data blocks that 
> happen to be on the same machine.  It could do this by examining the set of 
> LocatedBlocks returned by the NameNode, marking those that should be resident 
> on the local host.  Since the DataNode and DFSClient (probably) share the 
> same hadoop configuration, the DFSClient should be able to find the files 
> holding the block data, and it could directly open them and send data back to 
> the client.  This would avoid the context switches imposed by the network 
> layer, and would allow for much larger read buffers than 64KB, which should 
> reduce the number of iops imposed by each read block operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to