[ https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498420#comment-13498420 ]
Colin Patrick McCabe commented on HDFS-347: ------------------------------------------- Here are some benchmarks I did locally on a one-node cluster. I did these to confirm that there are no performance regressions with the new implementation. With HDFS-347 and {{dfs.client.read.shortcircuit}} = true and {{dfs.client.read.shortcircuit.skip.checksum}} = false: cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.46user 3.38system 0:09.50elapsed 114%CPU (0avgtext+0avgdata 423200maxresident)k 0inputs+104outputs (0major+25697minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.39user 3.37system 0:09.43elapsed 114%CPU (0avgtext+0avgdata 430352maxresident)k 0inputs+144outputs (0major+24399minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.41user 3.39system 0:09.51elapsed 113%CPU (0avgtext+0avgdata 439536maxresident)k 0inputs+144outputs (0major+25609minor)pagefaults 0swaps ========================================= With unmodified trunk and {{dfs.client.read.shortcircuit}} = true and {{dfs.client.read.shortcircuit.skip.checksum}} = false, and {{dfs.block.local-path-access.user}} = cmccabe: cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.60user 3.58system 0:09.89elapsed 113%CPU (0avgtext+0avgdata 444848maxresident)k 0inputs+64outputs (0major+25903minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.65user 3.44system 0:09.57elapsed 115%CPU (0avgtext+0avgdata 443824maxresident)k 0inputs+64outputs (0major+24054minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 7.50user 3.43system 0:09.42elapsed 116%CPU (0avgtext+0avgdata 422624maxresident)k 0inputs+64outputs (0major+25918minor)pagefaults 0swaps ========================================= with HDFS-347 and {{dfs.client.read.shortcircuit}} = false cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.15user 8.83system 0:17.88elapsed 106%CPU (0avgtext+0avgdata 412512maxresident)k 0inputs+224outputs (0major+24449minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.19user 8.55system 0:17.23elapsed 108%CPU (0avgtext+0avgdata 449248maxresident)k 0inputs+184outputs (0major+24109minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.24user 8.38system 0:17.16elapsed 108%CPU (0avgtext+0avgdata 439568maxresident)k 0inputs+144outputs (0major+23957minor)pagefaults 0swaps ========================================= with unmodified trunk and {{dfs.client.read.shortcircuit}} = false cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g / 1g /1g /1g /1g /1g /1g /1g >/dev/null 10.76user 8.64system 0:18.18elapsed 106%CPU (0avgtext+0avgdata 483872maxresident)k 0inputs+64outputs (0major+28735minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 10.59user 8.54system 0:17.46elapsed 109%CPU (0avgtext+0avgdata 491216maxresident)k 0inputs+64outputs (0major+27868minor)pagefaults 0swaps cmccabe@keter:/h> /usr/bin/time ./bin/hadoop fs -cat /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g /1g >/dev/null 9.81user 8.95system 0:17.24elapsed 108%CPU (0avgtext+0avgdata 422144maxresident)k 0inputs+64outputs (0major+25726minor)pagefaults 0swaps > DFS read performance suboptimal when client co-located on nodes with data > ------------------------------------------------------------------------- > > Key: HDFS-347 > URL: https://issues.apache.org/jira/browse/HDFS-347 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node, hdfs client, performance > Reporter: George Porter > Assignee: Colin Patrick McCabe > Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, > HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, > HDFS-347.016.patch, HDFS-347.017.clean.patch, HDFS-347.017.patch, > HDFS-347.018.clean.patch, HDFS-347.018.patch2, HDFS-347.019.patch, > HDFS-347.020.patch, HDFS-347.021.patch, HDFS-347.022.patch, > HDFS-347.024.patch, HDFS-347.025.patch, HDFS-347-branch-20-append.txt, > hdfs-347.png, hdfs-347.txt, local-reads-doc > > > One of the major strategies Hadoop uses to get scalable data processing is to > move the code to the data. However, putting the DFS client on the same > physical node as the data blocks it acts on doesn't improve read performance > as much as expected. > After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem > is due to the HDFS streaming protocol causing many more read I/O operations > (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB > disk block from the DataNode process (running in a separate JVM) running on > the same machine. The DataNode will satisfy the single disk block request by > sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, > this is done in the sendChunk() method, relying on Java's transferTo() > method. Depending on the host O/S and JVM implementation, transferTo() is > implemented as either a sendfilev() syscall or a pair of mmap() and write(). > In either case, each chunk is read from the disk by issuing a separate I/O > operation for each chunk. The result is that the single request for a 64-MB > block ends up hitting the disk as over a thousand smaller requests for 64-KB > each. > Since the DFSClient runs in a different JVM and process than the DataNode, > shuttling data from the disk to the DFSClient also results in context > switches each time network packets get sent (in this case, the 64-kb chunk > turns into a large number of 1500 byte packet send operations). Thus we see > a large number of context switches for each block send operation. > I'd like to get some feedback on the best way to address this, but I think > providing a mechanism for a DFSClient to directly open data blocks that > happen to be on the same machine. It could do this by examining the set of > LocatedBlocks returned by the NameNode, marking those that should be resident > on the local host. Since the DataNode and DFSClient (probably) share the > same hadoop configuration, the DFSClient should be able to find the files > holding the block data, and it could directly open them and send data back to > the client. This would avoid the context switches imposed by the network > layer, and would allow for much larger read buffers than 64KB, which should > reduce the number of iops imposed by each read block operation. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira