[ https://issues.apache.org/jira/browse/HDFS-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478935#comment-13478935 ]
Gopal V commented on HDFS-4070: ------------------------------- I agree that the kernels should be doing IO merging and handle concurrent disk-writes well, because they have global knowledge of I/O patterns. Ideally, the userland code should just let that be and write large, page aligned chunks of data & let the kernel do the actual magic. In this case, the userland layer is actually being unnecessarily chatty on syscalls. I'd hazard a guess that the performance boost is simply due to the reduced number of syscalls & from letting the kernel do fewer wake-ups of threads running CPU bound code (check-sums and such). Before I benchmarked it, I was more annoyed about the fact that the system out-right overrides my buffer settings (a decent tunable, if you will). I don't think the 15% number directly translates into any other benchmark, this obviously highlights the DFS writes & without any other operations in the middle. This was tested on an RHEL 5.5 box on EC2 (with 4 EBS volumes backing hdfs). 2.6.18-194.32.1.el5xen #1 SMP A real bare metal box would behave differently (I've ordered a new SSD backed box, it'll arrive in parts & get all working by next week I guess). In the mean time, if you can bench this on some real hardware, I'll have some foundation to my theories (beyond averaging stuff on EC2). > DFSClient ignores bufferSize argument & always performs small writes > -------------------------------------------------------------------- > > Key: HDFS-4070 > URL: https://issues.apache.org/jira/browse/HDFS-4070 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client > Affects Versions: 1.0.3, 2.0.3-alpha > Environment: RHEL 5.5 x86_64 (ec2) > Reporter: Gopal V > Priority: Minor > Attachments: > gistfe319436b880026cbad4-aad495d50e0d6b538831327752b984e0fdcc74db.tar.gz > > > The following code illustrates the issue at hand > {code} > protected void map(LongWritable offset, Text value, Context context) > throws IOException, InterruptedException { > OutputStream out = fs.create(new > Path("/tmp/benchmark/",value.toString()), true, 1024*1024); > int i; > for(i = 0; i < 1024*1024; i++) { > out.write(buffer, 0, 1024); > } > out.close(); > context.write(value, new IntWritable(i)); > } > {code} > This code is run as a single map-only task with an input file on disk and > map-output to disk. > {{# su - hdfs -c 'hadoop jar /tmp/dfs-test-1.0-SNAPSHOT-job.jar > file:///tmp/list file:///grid/0/hadoop/hdfs/tmp/benchmark'}} > In the data node disk access patterns, the following consistent pattern was > observed irrespective of bufferSize provided. > {code} > 21119 read(58, <unfinished ...> > 21119 <... read resumed> > "\0\1\0\0\0\0\0\0\0034\212\0\0\0\0\0\0\0+\220\0\0\0\376\0\262\252ux\262\252u"..., > 65557) = 65557 > 21119 lseek(107, 0, SEEK_CUR <unfinished ...> > 21119 <... lseek resumed> ) = 53774848 > 21119 write(107, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65024 > <unfinished ...> > 21119 <... write resumed> ) = 65024 > 21119 write(108, > "\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux\262\252ux"..., > 508 <unfinished ...> > 21119 <... write resumed> ) = 508 > {code} > Here fd 58 is the incoming socket, 107 is the blk file and 108 is the .meta > file. > The DFS packet size ignores the bufferSize argument and suffers from > suboptimal syscall & disk performance because of the default 64kb value, as > is obvious from the interrupted read/write operations. > Changing the packet size to a more optimal 1056405 bytes results in a decent > spike in performance, by cutting down on disk & network iops. > h3. Average time (milliseconds) for a 10 GB write as 10 files in a single map > task > ||timestamp||65536||1056252|| > |1350469614|88530|78662| > |1350469827|88610|81680| > |1350470042|92632|78277| > |1350470261|89726|79225| > |1350470476|92272|78265| > |1350470696|89646|81352| > |1350470913|92311|77281| > |1350471132|89632|77601| > |1350471345|89302|81530| > |1350471564|91844|80413| > That is by average an increase from ~115 MB/s to ~130 MB/s, by modifying the > global packet size setting. > This suggests that there is value in adapting the user provided buffer sizes > to hadoop packet sizing, per stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira