[ https://issues.apache.org/jira/browse/HDFS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831600#action_12831600 ]
Todd Lipcon commented on HDFS-959: ---------------------------------- bq. I have been thinking of making this configurable +1 - I opened HDFS-962. > Performance improvements to DFSClient and DataNode for faster DFS write at > replication factor of 1 > -------------------------------------------------------------------------------------------------- > > Key: HDFS-959 > URL: https://issues.apache.org/jira/browse/HDFS-959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: data-node, hdfs client > Affects Versions: 0.20.2, 0.22.0 > Environment: RHEL5 on Dual CPU quad-core Intel servers, 16 GB RAM, 4 > SATA disks. > Reporter: Naredula Janardhana Reddy > Fix For: 0.20.2, 0.22.0 > > > The following improvements are suggested to DFSClient and DataNode to improve > DFS write throughput, based on experimental verification with replication > factor of 1. > The changes are useful in principle for replication factors of 2 and 3 as > well, but they do not currently demonstrate noticeable performance > improvement in our test-bed because of a network throughput bottleneck that > hides the benefit of these changes. > All changes are applicable to 0.20.2. Some of them are applicable to trunk, > as noted below. I have not verified applicability to 0.21. > List of Improvements > ----------------------------- > Item 1: DFSCilent. Finer grain locks in WriteChunk(). Currently the lock is > held at the data block level (512 bytes). It can be moved to the packet level > (64kbytes), to lower the frequency of locking. > This optimization applies to 20.2. It already appears in trunk. > Item 2: Misc. improvements to DataNode > 2.1: Concurrency of Disk Writes: Check sum verification and writing data to > disk can be moved to a separate thread ("Disk Write Thread"). This will allow > the existing "network thread" to trigger faster acks to the DFSClient. This > will also allow the packet to be transmitted to the replication node faster. > In effect, this will allow DataNode to consume packets at higher speeds. > This optimization applies to 20.2 and trunk. > 2.2: Bulk Receive and Bulk Send: This optimization is enabled by doing 2.1. > We can now have DataNode receive more than one packet at a time since we have > added a buffer between the (existing) network thread and the (newly added) > Disk Write thread. > This optimization applies to 20.2 and trunk. > 2.3: Early Ack: The proposed optimization is to send out acks to the client > as soon as possible instead of waiting for the disk write. Note that, the > last ack is an exception: It will be sent only after data has been flushed to > the OS. > This optimization applies to 20.2. It already appears in trunk. > 2.4: lseek optimization: Currently lseek (the system call) is called before > every disk write, which is not necessary when the write is sequential. The > propsed optimization calls lseek only when necessary. > This optimization applies to 20.2. I was unable to tell if it is already in > trunk. > 2.5 Checksum buffered writes: Currently checksum is written in a buffered > stream of size 512 bytes. This can be increased to a higher numbers - such as > 4kbytes - to lower the number of write() system calls. This will save context > switch overhead. > This optimization applies to 20.2. I was unable to tell if it is already in > trunk. > Item 3: Applying HADOOP-6166 - PureJavaCrc32() - from trunk to 20.2 > This is applicable to 20.2. It already appears in trunk. > Performance Experiments Results > ----------------------------------------------- > Performance experiments showed the following numbers: > Hadoop Version: 0.20.2 > Server Configs: RHEL5, Quad-core dual-CPU, 16GB RAM, 4 SATA disks > $ uname -a > Linux gsbl90324.blue.ygrid.yahoo.com 2.6.18-53.1.13.el5 #1 SMP Mon Feb 11 > 13:27:27 EST 2008 x86_64 x86_64 x86_64 GNU/Linux > $ cat /proc/cpuinfo > model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz > $ cat /etc/issue > Red Hat Enterprise Linux Server release 5.1 (Tikanga) > Kernel \r on an \m > Benchmark Details > -------------------------- > Benchmark Name: DFSIO > Benchmark Configuration: > a) # maps (writers to DFS per node). Tried the following values: 1,2,3 > b) # of nodes: Single-node test and 15-node cluster test > Results Summary > -------------------------- > a) With all the above optimizations turned on > All these tests were done with replication factor of 1. Tests with > replication factors of 2 and 3 showed no noticeably improvement, because > these improvements are shielded by network bandwidth as noted above. > What was measured: Write throughput per client (in MB/s) > | Test Description | > Baseline (MB/s) | With improvements (MB/s) | % improvement | > | 15-node cluster with 1 map (writer) per node | 103 > | 147 | ~43 % > | > | Single node test with 1 maps (writer) per node | 102 > | 148 | ~45 % > | > | Single node test with 2 maps (writers) per node | 86 > | 101 | ~16 % > | > | Single node test with 3 maps (writers) per node | 67 > | 76 | ~13 % > | > > a) With above optimizations turned on individually > I ran some experiments by adding and removing items individually to > understand the approximate range of performance contribution from each item. > These are the numbers I got (They are approximate). > | ITEM | Title > | Improvement in 0.20 | Improvement in trunk | > | Item 1 | DFSCilent. Finer grain locks in WriteChunk() | 30% > | Already in trunk | > | Item 2.1 | Concurrency of Disk Writes | > 25% | 15-20% | > | Item 2.2 | Bulk Receive and Bulk Send | > 2% | (Have not yet tried) | > | Item 2.3 | Early Ack > | 2% | Already in trunk > | > | Item 2.4 | lseek optimization > | 2% | (Have not yet tried) | > | Item 2.5 | Checksum buffered writes | > 2% | (Have not yet tried) | > | Item 3 | Applying HADOOP-6166 - PureJavaCrc32() | 15% > | Already in trunk | > Patches > ----------- > I will submit a patch for 0.20.2 shortly (in a day). > I expect to submit a patch for trunk after review comments for above patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.