[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187755#comment-15187755
 ] 

Colin Patrick McCabe commented on HDFS-9924:
--------------------------------------------

Currently the NameNode can handle between 10k and 100k operations per second, 
depending on configuration and the nature of the operations.  It seems like you 
should be able to comfortably dispatch that many operations from a few thousand 
client threads performing synchronous RPC calls... bearing in mind that each 
operation will take a few milliseconds on average.  This is assuming that you 
want to consume all the available NN RPC bandwidth from a single client node.

Perhaps I'm missing something, but I don't see how async operations will 
improve performance here.  The overhead of a few thousand threads on the client 
is small, and certainly not what is limiting HDFS performance.  Rather, 
performance is limited by considerations like the locking on the NameNode, Java 
garbage collections on the NameNode, and serialization/deserialization 
overheads.

Please keep in mind that you don't need async operations to reuse connections 
and sockets... we do that already via mechanisms like the {{PeerCache}} 
(formerly {{SocketCache}}).  Clearly, Hive can also dispatch operations in 
parallel using standard mechanisms like an Executor or ThreadPool.  I certainly 
don't object to implementing this, but if the goal is better performance, I 
think you are going to be disappointed.  Perhaps I have missed something, 
though... I'm curious if there are reasons for implementing this that I have 
not considered.

> [umbrella] Asynchronous HDFS Access
> -----------------------------------
>
>                 Key: HDFS-9924
>                 URL: https://issues.apache.org/jira/browse/HDFS-9924
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to