[ 
https://issues.apache.org/jira/browse/HDFS-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495963#comment-13495963
 ] 

Todd Lipcon commented on HDFS-347:
----------------------------------

This is no longer insecure - it uses file descriptor passing over a unix socket 
so that the DN is the one arbitrating all access.

I implemented a couple of the optimizations mentioned above (avoiding 
CallIntMethod and adding sendfile() support) and now the unix data path is a 
little bit faster than the TCP path:

{code}
over unix sockets:
todd@todd-w510:~/git/hadoop-common/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT$ 
time ./bin/hadoop fs -Ddfs.datanode.domain.socket.path=/tmp/dn-sock  -cat $(for 
x in $(seq 1 20) ; do echo /user/todd/1GB ; done) | wc -c

datanode utime: 2.02
datanode stime: 11.22
real    0m24.137s
user    0m12.530s
sys     0m16.270s

over TCP:
todd@todd-w510:~/git/hadoop-common/hadoop-dist/target/hadoop-3.0.0-SNAPSHOT$ 
time ./bin/hadoop fs   -cat $(for x in $(seq 1 20) ; do echo /user/todd/1GB ; 
done) | wc -c
20971520000
datanode utime: 5.47
datanode stime: 6.52
real    0m26.473s
user    0m12.750s
sys     0m21.010s
{code}
The results above are a bit strange that the system time is better on the DN 
for TCP vs local sockets. I'm guessing a little investigation there will make 
it a bit more clear - perhaps a similar improvement to the writeBuffer code 
would yield a speedup.


Something seems to be wrong with the fd-passing (short-circuit) path in this 
patch, though. When I enabled it, I could tell from jstacks that it was 
"working" but I got really slow performance:
{code}
real    1m5.366s
user    0m35.710s
sys     0m37.700s
{code}

I couldn't understand from the code why BlockReaderLocal is substantially 
rewritten. I'd think it would be pretty much identical after the point where 
you get the files open. I'm guessing the rewrite is what killed performance 
here.
                
> DFS read performance suboptimal when client co-located on nodes with data
> -------------------------------------------------------------------------
>
>                 Key: HDFS-347
>                 URL: https://issues.apache.org/jira/browse/HDFS-347
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client, performance
>            Reporter: George Porter
>            Assignee: Colin Patrick McCabe
>         Attachments: all.tsv, BlockReaderLocal1.txt, HADOOP-4801.1.patch, 
> HADOOP-4801.2.patch, HADOOP-4801.3.patch, HDFS-347-016_cleaned.patch, 
> HDFS-347.016.patch, HDFS-347.017.clean.patch, HDFS-347.017.patch, 
> HDFS-347.018.clean.patch, HDFS-347.018.patch2, HDFS-347.019.patch, 
> HDFS-347.020.patch, HDFS-347.021.patch, HDFS-347.022.patch, 
> HDFS-347-branch-20-append.txt, hdfs-347.png, hdfs-347.txt, local-reads-doc
>
>
> One of the major strategies Hadoop uses to get scalable data processing is to 
> move the code to the data.  However, putting the DFS client on the same 
> physical node as the data blocks it acts on doesn't improve read performance 
> as much as expected.
> After looking at Hadoop and O/S traces (via HADOOP-4049), I think the problem 
> is due to the HDFS streaming protocol causing many more read I/O operations 
> (iops) than necessary.  Consider the case of a DFSClient fetching a 64 MB 
> disk block from the DataNode process (running in a separate JVM) running on 
> the same machine.  The DataNode will satisfy the single disk block request by 
> sending data back to the HDFS client in 64-KB chunks.  In BlockSender.java, 
> this is done in the sendChunk() method, relying on Java's transferTo() 
> method.  Depending on the host O/S and JVM implementation, transferTo() is 
> implemented as either a sendfilev() syscall or a pair of mmap() and write().  
> In either case, each chunk is read from the disk by issuing a separate I/O 
> operation for each chunk.  The result is that the single request for a 64-MB 
> block ends up hitting the disk as over a thousand smaller requests for 64-KB 
> each.
> Since the DFSClient runs in a different JVM and process than the DataNode, 
> shuttling data from the disk to the DFSClient also results in context 
> switches each time network packets get sent (in this case, the 64-kb chunk 
> turns into a large number of 1500 byte packet send operations).  Thus we see 
> a large number of context switches for each block send operation.
> I'd like to get some feedback on the best way to address this, but I think 
> providing a mechanism for a DFSClient to directly open data blocks that 
> happen to be on the same machine.  It could do this by examining the set of 
> LocatedBlocks returned by the NameNode, marking those that should be resident 
> on the local host.  Since the DataNode and DFSClient (probably) share the 
> same hadoop configuration, the DFSClient should be able to find the files 
> holding the block data, and it could directly open them and send data back to 
> the client.  This would avoid the context switches imposed by the network 
> layer, and would allow for much larger read buffers than 64KB, which should 
> reduce the number of iops imposed by each read block operation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to