[ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551755#comment-13551755 ]
Todd Lipcon commented on HDFS-347: ---------------------------------- I believe all of the component pieces have now been committed to the HDFS-347 branch. I ran a number of benchmarks yesterday on the branch in progress, and just re-confirmed the results from the code committed in SVN. Here's a report of the benchmarks and results: h1. Benchmarks To validate the branch, I ran a series of before/after benchmarks, specifically focused on random-read. In particular, I ran benchmarks based on TestParallelRead, which has different variants which run the same workload through the different read paths. On the trunk ("before") branch, I ran TestParallelRead (normal read path) and TestParallelLocalRead (read path based on HDFS-2246). On the HDFS-347 branch, I ran TestParallelRead (normal read path) and TestParallelShortCircuitRead (new short-circuit path). I made the following modifications to the test cases to act as a better benchmark: 1) Modified to 0% PROPORTION_NON_READ: Without this modification, I found that both the 'before' and 'after' tests became lock-bound, since the 'seek-and-read' workload holds a lock on the DFSInputStream. So, this obscured the actual performance differences between the data paths. 2) Modified to 30,000 iterations Simply jacked up the number of iterations to get more reproducible results and ensure that the JIT had plenty of time to kick in (the benchmarks ran for ~50seconds each with this change instead of only ~5sec) 3) Added a variation which has two target blocks I had a thought that there could potentially be a regression for workloads which frequently switch back and forth between two different blocks of the same file. This variation is the same test, but with the DFS Block Size set to 128KB, so that the 256KB test file is split into two equal sized blocks. This causes a good percentage of the random reads to span block boundaries, and ensures that the various caches in the code work OK even when moving between different blocks. h2. Comparing non-local read When the new code path is disabled, or when the DN is not local, we continue to use the existing code path. We expect that this code path's performance should be unaffected. Results: || Test || #Threads || #Files|| Trunk MB/sec || HDFS-347 MB/sec || || TestParallelRead | 4 | 1 | 428.4 | 423.0 | || TestParallelRead | 16 | 1 | 669.5 | 651.1 | || TestParallelRead | 8 | 2 | 603.4 | 582.7 | || TestParallelRead 2-blocks | 4 | 1 | 354.0 | 345.9 | || TestParallelRead 2-blocks | 16 | 1 | 534.9 | 520.0 | || TestParallelRead 2-blocks | 8 | 2 | 483.1 | 460.8 | The above numbers seem to show a 2-4% regression, but I think it's within the noise on my machine (other software was running, etc). Colin also has one or two ideas for micro-optimizations which might win back a couple percent here and there, if it's not just noise. To put this in perspective, here are results for the same test against branch-1: || Test || #Threads || #Files|| Branch-1 || || TestParallelRead | 4 | 1 | 229.7 | || TestParallelRead | 16 | 1 | 264.4 | || TestParallelRead | 8 | 2 | 260.1 | (so trunk is 2-3x as fast as branch-1) h2. Comparing local read Here we expect the performance to be as good or better than the old (HDFS-2246) implementation. Results: || Test || #Threads || #Files|| Trunk MB/sec || HDFS-347 MB/sec || || TestParallelLocalRead | 4 | 1 | 901.4 | 1033.6 | || TestParallelLocalRead | 16 | 1 | 1079.8 | 1203.9 | || TestParallelLocalRead | 8 | 2 | 1087.4 | 1274.0 | || TestParallelLocalRead 2-blocks | 4 | 1 | 856.6 | 919.2 | || TestParallelLocalRead 2-blocks | 16 | 1 | 1045.8 | 1137.0 | || TestParallelLocalRead 2-blocks | 8 | 2 | 966.7 | 1392.9 | The result shows that the new implementation is indeed between 10% and 44% faster than the HDFS-2246 implementation. We're theorizing that the reason is because the old implementation would cache block paths, but not open file descriptors. So, because every positional read creates a new BlockReader, it would have to issue new {{open()}} syscalls, even if the location was cached. h2. Comparing sequential read I used the BenchmarkThroughput tool, configured to write a 1GB file, and then read it back 100 times. This ensures that it's in buffer cache, so that we're benchmarking CPU overhead (since the actual disk access didn't change in the patch, and we're looking for a potential regression in CPU resource usage). I recorded the MB/sec rate for the short-circuit before and short-circuit after, and then loaded the data into R and ran a T-test: {code} > d.before <- read.table(file="/tmp/before-patch.txt") > d.after <- read.table(file="/tmp/after-patch.txt") > t.test(d.before, d.after) > d.before <- read.table(file="/tmp/before-patch.txt") > d.after <- read.table(file="/tmp/after-patch.txt") > t.test(d.before, d.after) Welch Two Sample t-test data: d.before and d.after t = 0.5936, df = 199.777, p-value = 0.5535 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -62.39975 116.14431 sample estimates: mean of x mean of y 2939.456 2912.584 {code} The p-value 0.55 means that there's no statistically significant difference in the performance of the two data paths for sequential workloads. I did the same thing with short-circuit disabled and got the following t-test results for the RemoteBlockReader code path: {code} > d.before <- read.table(file="/tmp/before-patch-rbr.txt") > d.after <- read.table(file="/tmp/after-patch-rbr.txt") > t.test(d.before, d.after) Welch Two Sample t-test data: d.before and d.after t = 1.155, df = 199.89, p-value = 0.2495 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -18.69172 71.54320 sample estimates: mean of x mean of y 1454.653 1428.228 {code} Again, the p-value 0.25 means there's no significant difference in performance. h2. Summary The patch provides a good speedup (up to 40% in one case) for some random read workloads, and has no discernible negative impact on others. > DFS read performance suboptimal when client co-located on nodes with data > ------------------------------------------------------------------------- > > Key: HDFS-347 > URL: https://issues.apache.org/jira/browse/HDFS-347 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs-client, performance > Reporter: George Porter > Assignee: Colin Patrick McCabe > Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, > HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, > HDFS-347.016.patch, HDFS-347.017.clean.patch, HDFS-347.017.patch, > HDFS-347.018.clean.patch, HDFS-347.018.patch2, HDFS-347.019.patch, > HDFS-347.020.patch, HDFS-347.021.patch, HDFS-347.022.patch, > HDFS-347.024.patch, HDFS-347.025.patch, HDFS-347.026.patch, > HDFS-347.027.patch, HDFS-347.029.patch, HDFS-347.030.patch, > HDFS-347.033.patch, HDFS-347.035.patch, HDFS-347-branch-20-append.txt, > hdfs-347.png, hdfs-347.txt, local-reads-doc > > > One of the major strategies Hadoop uses to get scalable data processing is to > move the code to the data. However, putting the DFS client on the same > physical node as the data blocks it acts on doesn't improve read performance > as much as expected. > After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem > is due to the HDFS streaming protocol causing many more read I/O operations > (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB > disk block from the DataNode process (running in a separate JVM) running on > the same machine. The DataNode will satisfy the single disk block request by > sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, > this is done in the sendChunk() method, relying on Java's transferTo() > method. Depending on the host O/S and JVM implementation, transferTo() is > implemented as either a sendfilev() syscall or a pair of mmap() and write(). > In either case, each chunk is read from the disk by issuing a separate I/O > operation for each chunk. The result is that the single request for a 64-MB > block ends up hitting the disk as over a thousand smaller requests for 64-KB > each. > Since the DFSClient runs in a different JVM and process than the DataNode, > shuttling data from the disk to the DFSClient also results in context > switches each time network packets get sent (in this case, the 64-kb chunk > turns into a large number of 1500 byte packet send operations). Thus we see > a large number of context switches for each block send operation. > I'd like to get some feedback on the best way to address this, but I think > providing a mechanism for a DFSClient to directly open data blocks that > happen to be on the same machine. It could do this by examining the set of > LocatedBlocks returned by the NameNode, marking those that should be resident > on the local host. Since the DataNode and DFSClient (probably) share the > same hadoop configuration, the DFSClient should be able to find the files > holding the block data, and it could directly open them and send data back to > the client. This would avoid the context switches imposed by the network > layer, and would allow for much larger read buffers than 64KB, which should > reduce the number of iops imposed by each read block operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira