[ 
https://issues.apache.org/jira/browse/HADOOP-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Porter updated HADOOP-4801:
----------------------------------

    Attachment: HADOOP-4801.1.patch

This patch contains a proof-of-concept implementation of "HDFS Direct-I/O", 
which is a mechanism for HDFS clients to directly access datablocks residing on 
the local host.  It does this by bypassing the standard HDFS datablock 
streaming protocol to the local DataNode, and instead locates and opens the raw 
datablocks directly.

The changes are mostly contained to DFSClient.java.  BlockReader is now an 
interface, and there are two implementing classes: RemoteBlockReader and 
DirectBlockReader.  RemoteBlockReader is the baseline, working exactly as 
before.  DirectBlockReader instances are created in blockSeekTo(long).  A check 
is made to see if the requested offset resides in a local block on the same 
machine as the DFSClient.  If so, then the DirectBlockReader opens that file 
and passes through any I/O operations directly to the Java filesystem layer.

I performed two main sets of performance testing: a streaming test and a random 
read test.  Both tests were performed on a single machine running the Hadoop 
trunk code in the pseudodistributed mode (with 1 datanode, 1 namenode, and 
dfs.replication set to 1 and the default block size).

In the streaming test, I opened a 1GB file and read it from start to finish 
into memory.
  Baseline: 8.730 seconds with std. dev of 0.052
  DirectIO: 5.266 seconds with std. dev of 0.116

In the random test, I opened a 1GB or 4GB file, then performed a set of random 
reads, by picking a random offset in the file, seeking to that offset, and 
reading 1KB of data.

For the 1 GB file
  Baseline with 1024 reads: 861 reads/second
  DirectIO with 1024 reads: 5988 reads/second

  Baseline with 4096 reads: 1065 reads/second
  DirectIO with 4096 reads: 9287 reads/second

For the 4 GB file
  Baseline with 65,535 reads: 535 reads/second
  DirectIO with 65,535 reads: 17,852 reads/second

It's hard to tell how much these results are affected by various disk caches, 
etc., and so I wanted to put this patch out there to get your experiences with 
it.  Obviously local block read performance will only improve application 
performance to the extent that you read from locally resident disk blocks.

Your feedback appreciated!  Thanks.

> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-4801
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4801
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.19.0
>            Reporter: George Porter
>         Attachments: HADOOP-4801.1.patch
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to 
> move the code to the data.  However, putting the DFS client on the same 
> physical node as the data blocks it acts on doesn't improve read performance 
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem 
> is due to the HDFS streaming protocol causing many more read I/O operations 
> (iops) than necessary.  Consider the case of a DFSClient fetching a 64 MB 
> disk block from the DataNode process (running in a separate JVM) running on 
> the same machine.  The DataNode will satisfy the single disk block request by 
> sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java, 
> this is done in the sendChunk() method, relying on Java's transferTo() 
> method.  Depending on the host O/S and JVM implementation, transferTo() is 
> implemented as either a sendfilev() syscall or a pair of mmap() and write().  
> In either case, each chunk is read from the disk by issuing a separate I/O 
> operation for each chunk.  The result is that the single request for a 64-MB 
> block ends up hitting the disk as over a thousand smaller requests for 64-KB 
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode, 
> shuttling data from the disk to the DFSClient also results in context 
> switches each time network packets get sent (in this case, the 64-kb chunk 
> turns into a large number of 1500 byte packet send operations).  Thus we see 
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think 
> providing a mechanism for a DFSClient to directly open data blocks that 
> happen to be on the same machine.  It could do this by examining the set of 
> LocatedBlocks returned by the NameNode, marking those that should be resident 
> on the local host.  Since the DataNode and DFSClient (probably) share the 
> same hadoop configuration, the DFSClient should be able to find the files 
> holding the block data, and it could directly open them and send data back to 
> the client.  This would avoid the context switches imposed by the network 
> layer, and would allow for much larger read buffers than 64KB, which should 
> reduce the number of iops imposed by each read block operation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to