[ https://issues.apache.org/jira/browse/HDFS-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved HDFS-296. ----------------------------------- Resolution: Incomplete I'm going to close this as stale given how many changes have happened to both streaming and HDFS since this was filed. > Serial streaming performance should be Math.min(ideal client performance, > ideal serial hdfs performance) > -------------------------------------------------------------------------------------------------------- > > Key: HDFS-296 > URL: https://issues.apache.org/jira/browse/HDFS-296 > Project: Hadoop HDFS > Issue Type: Improvement > Environment: Mac OS X 10.5.2, Java 6 > Reporter: Sam Pullara > > I looked at all the code long and hard and this was my analysis (could be > wrong, I'm not an expert on this codebase): > Current Serial HDFS performance = Average Datanode Performance > Average Datanode Performance = Average Disk Performance (even if you have > more than one) > We should have: > Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance > Ideal Datanode Performance = Sum of disk performance > When you read a single file serially from HDFS there are a number of > limitations that come into play: > 1) Blocks on multiple datanodes will be load balanced between them - > averaging the performance of the datanodes > 2) Blocks on multiple disks in a single datanode are load balanced between > them - averaging the performance of the disks > I think that all this could be fixed if we actually prefetched fully read > blocks on the client until the client can no longer keep up with the data or > there is another bottleneck like network bandwidth. > This seems like a reasonably common use case though not the typical MapReduce > case. -- This message was sent by Atlassian JIRA (v6.2#6252)