[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-08 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186512#comment-15186512
 ] 

Chris Nauroth commented on HDFS-9924:
-

+1 for the proposal.  I expect this will be very helpful for workloads like 
Hive partition renames, which consist of a large set of rename operations, all 
independent of one another with no ordering dependencies between the 
operations.  There will be some tricky details to work out in the RPC client 
layer, but it's well worth the effort in my opinion.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-09 Thread Bob Hansen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187193#comment-15187193
 ] 

Bob Hansen commented on HDFS-9924:
--

We are currently implementing a native C++ asynchronous API for HDFS under the 
HDFS-8707 umbrella.  While I don't see tremendous value in making them 
equivalent, perhaps there are some lessons learned that could translate.

Futures are a good match for the use case where the consumer wants to kick of a 
multitude of async requests and wait until they are all done to make progress, 
but we've found that there are also compelling use cases where you want a small 
amount of logic and further async I/O in a completion handler, so I might 
recommend supporting both Future-based results as well as callback-based 
results.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-09 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187273#comment-15187273
 ] 

Chris Nauroth commented on HDFS-9924:
-

[~bobhansen], thank you for the comment.  It's an interesting thought.  I was 
going to suggest providing a model of futures + promises as a further 
enhancement to be done later.  That would give applications an elegant model 
for chaining together a sequence of async operations that have specific 
ordering dependencies.

Our main motivation right now is the Hive use case I described in my last 
comment.  Futures alone are a good fit for that, hence the motivation to defer 
consideration of promises to a later enhancement, outside the scope of the 
current issue.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187732#comment-15187732
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

Thanks [~bobhansen], supporting callback is a good idea.  We may allow user to 
register callbacks when it makes an async call.  For example, we may support 
[ListenableFuture|http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/ListenableFuture.html].
  Let me put it as a subtask.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-09 Thread Bob Hansen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187742#comment-15187742
 ] 

Bob Hansen commented on HDFS-9924:
--

Dandy.  I wanted to capture the use case and leave it up to you fine, smart 
people to come up with a great solution.  I look forward to seeing your 
progress and stealing your best ideas.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-09 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187755#comment-15187755
 ] 

Colin Patrick McCabe commented on HDFS-9924:


Currently the NameNode can handle between 10k and 100k operations per second, 
depending on configuration and the nature of the operations.  It seems like you 
should be able to comfortably dispatch that many operations from a few thousand 
client threads performing synchronous RPC calls... bearing in mind that each 
operation will take a few milliseconds on average.  This is assuming that you 
want to consume all the available NN RPC bandwidth from a single client node.

Perhaps I'm missing something, but I don't see how async operations will 
improve performance here.  The overhead of a few thousand threads on the client 
is small, and certainly not what is limiting HDFS performance.  Rather, 
performance is limited by considerations like the locking on the NameNode, Java 
garbage collections on the NameNode, and serialization/deserialization 
overheads.

Please keep in mind that you don't need async operations to reuse connections 
and sockets... we do that already via mechanisms like the {{PeerCache}} 
(formerly {{SocketCache}}).  Clearly, Hive can also dispatch operations in 
parallel using standard mechanisms like an Executor or ThreadPool.  I certainly 
don't object to implementing this, but if the goal is better performance, I 
think you are going to be disappointed.  Perhaps I have missed something, 
though... I'm curious if there are reasons for implementing this that I have 
not considered.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15187886#comment-15187886
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> ...  It seems like you should be able to comfortably dispatch that many 
> operations from a few thousand client threads ...

As mentioned in the description, it is inefficient if a client needs to create 
a large number of threads to invoke the calls.  Indeed, there is a limit of the 
number of threads in a JVM.  It is wasting resource to create threads and use 
them for waiting.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189497#comment-15189497
 ] 

Steve Loughran commented on HDFS-9924:
--

As I note in HADOOP-12910, someone is going to have to very clearly define 
concurrency here, at least with the same FS instance. If you can lay down that 
async operations happen in the order requested, then that's a stronger 
guarantee than having a pool of threads doing synchronous calls of runnables. 

If you can't make any guarantees about ordering of calls in a client, then I 
find it really hard to see how I would use this as a developer except for the 
odd cleanup operation like an async delete, or a series of mkdir calls I knew 
were independent.

As I also stated there, this sounds like a good chance to play with Java 8's 
new language features, like Bob's suggestion of calling a function after. 
Example, if the async operation took a function ()->T, you could have a chain 
of operations

{code}
fs.mkdirs("/a/b", rename("/c", "/a/b/c", delete("/d", (()-> log("done!")))
{code}

If something like that could be done —so it'd be possible to make sequential 
code in a way that wasn't so complex as to be unusable or simply incorrect, 
then you could have some fun

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189626#comment-15189626
 ] 

Chris Nauroth commented on HDFS-9924:
-

bq. If you can't make any guarantees about ordering of calls in a client, then 
I find it really hard to see how I would use this as a developer except for the 
odd cleanup operation like an async delete, or a series of mkdir calls I knew 
were independent.

If you buy into my argument over on HADOOP-12910, then even without ordering 
guarantees, this is helpful for use cases like mass Hive partition renames.

bq. Example, if the async operation took a function ()->T, you could have a 
chain of operations...

This is like futures + promises that I mentioned earlier.  I had been thinking 
of it as a future (no pun intended) enhancement outside the scope of this JIRA, 
but perhaps it's not too much effort to lay down the right method signatures to 
support it upfront.  That's worth exploring.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189749#comment-15189749
 ] 

Colin Patrick McCabe commented on HDFS-9924:


Can you quantify what the performance improvements will be here for Hive?  What 
is the performance delta of an async API versus just making vanilla synchronous 
HDFS calls from a thread pool?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189945#comment-15189945
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> ... If you can lay down that async operations happen in the order requested, 
> then that's a stronger guarantee than having a pool of threads doing 
> synchronous calls of runnables.

As the first step, we won't provide ordering guarantee as [~cnauroth] 
mentioned.  It is a nice thing to have but it needs more works, in particular, 
server side change.  We will put down ordering guarantee as a future work.

> ... then I find it really hard to see how I would use this as a developer ...

As you pointed out, it is for independent operations such as parallel deletes 
and parallel (independent) renames.  The number of operations may be large 
(hundreds or thousands) so that the parallelism really helps.

On the other hand, I don't see a need to support running dependent operations 
in asynchronous mode since the number of operations is usually small.

> ... this sounds like a good chance to play with Java 8's new language 
> features ...

Personally, I also like to play with Java 8's new features.  Unfortunately, it 
is a future work and definitely outside the scope of the first step.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190112#comment-15190112
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Can you quantify what the performance improvements will be here for Hive? 
> What is the performance delta of an async API versus just making vanilla 
> synchronous HDFS calls from a thread pool?

Comparing with a single thread, the performance improvement gains is obvious.  
I expect operations taking a few hours in single thread can be improved to a 
few minutes using asynchronous calls.  Of course, I don't yet have asynchronous 
calls to test.

The main problem of vanilla synchronous HDFS calls from a thread pool is not 
performance -- it is the thread creations.  As mentioned in HADOOP-12909, the 
underlying RPC mechanism is already supporting asynchronous calls.   Currently, 
synchronous call is implemented by invoking wait() in the caller thread in 
order to wait for the server response.  Now, if we use threads to do 
asynchronous calls, each thread will just be blocked by wait().  This probably 
is going to be taught in universities as a don't-create-threads-to-wait anti 
pattern, I guess.  :)

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190179#comment-15190179
 ] 

Steve Loughran commented on HDFS-9924:
--

So the use cases are one or more rename/delete operations to be fired at the 
NN, with a sync notifications as they complete? Other things: stat(), 
getBlockLocations, open(), aren't considered operations you would want to do? I 
concur, though a copy op would be nice to have 

But do we need async calls, or could we do something like {{Future<> 
exec(List)}} and the NN given a list of actions for it to execute, 
ideally in order? As that would be one RPC, and perhaps, if locks were held 
over opts, efficiently. 

Was something like that considered?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-03-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190211#comment-15190211
 ] 

Colin Patrick McCabe commented on HDFS-9924:


bq. Comparing with a single thread, the performance improvement gains is 
obvious. I expect operations taking a few hours in single thread can be 
improved to a few minutes using asynchronous calls. Of course, I don't yet have 
asynchronous calls to test.

Thanks for replying.  I wasn't trying to compare making the calls from a single 
thread with making the calls asynchronously.  I was asking for a comparison 
between making the calls from a thread pool and making them asynchronously.

bq. The main problem of vanilla synchronous HDFS calls from a thread pool is 
not performance – it is the thread creations.

Since Hive is a long-running process, it doesn't need to create new threads all 
the time.  It can just have some long-running threads that it uses for this 
purpose.  There are even Guava Executor classes that make this even easier by 
keeping threads around for a configurable length of time and expiring them if 
they're not used within N seconds.

Anyway, I can create 20,000 threads in about a minute in my simple test program:
{code}
import java.lang.System;
import java.lang.Thread;
import java.lang.Runnable;
import java.util.concurrent.CountDownLatch;

public class Thready {
  private static int NUM_THREADS = 2;

  static CountDownLatch latch = new CountDownLatch(NUM_THREADS);

  private static class NoOpRunnable implements Runnable {
@Override
public void run() {
  latch.countDown();
}
  }

  public static void main(String[] args) {
try {
  Thread[] threads = new Thread[NUM_THREADS];
  long startTimeNs = System.nanoTime();
  for (int i = 0; i < NUM_THREADS; i++) {
threads[i] = new Thread(new NoOpRunnable());
threads[i].run();
  }
  for (int i = 0; i < NUM_THREADS; i++) {
latch.await();
  }
  long endTimeNs = System.nanoTime();
  long deltaNs = endTimeNs - startTimeNs;
  System.out.println("Created " + NUM_THREADS +
  " threads in " + (deltaNs / 1000) + " ms");
  for (int i = 0; i < NUM_THREADS; i++) {
  threads[i].join();
  }
} catch (InterruptedException e) {
  throw new RuntimeException("unexpected InterruptedException", e);
}
  }
}
{code}

If I can create 20,000 threads in a minute, it seems unlikely that a thread 
pool solution will take "hours" compared to "a few minutes" with an async API.

Like I said earlier, I don't object to an async API for HDFS.  In Kudu we made 
async calls available as part of the RPC system.  See 
https://github.com/cloudera/kudu/tree/master/src/kudu/rpc  But Kudu can handle 
a higher volume of RPCs per second than the NameNode so there's more motivation 
for avoiding the client-thread-per-request paradigm.  Improving the speed at 
which the client can shove requests at the NameNode does no good if the 
NameNode is already processing requests at maximum speed.  You will just end up 
queueing more requests on the NN side instead of the client-side.  Perhaps once 
federation is more widely deployed, an async API will start to make more sense.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-03 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269766#comment-15269766
 ] 

Andrew Wang commented on HDFS-9924:
---

Hi all, a few notes:

* Can we do this work on a branch? This is a big addition to the HDFS API, so I 
think needs some broader buy-in from the community and validation before 
merging. Since performance is a stated goal, a performance evaluation seems 
like a merge requirement.
* Could someone post a design doc with the motivations, proposed API, and 
discussion? It'd help to go over the pros/cons of the different API options. 
ListenableFuture for instance has also been brought up. Reviewing some other 
async RPC interfaces for comparison would also be helpful. This design doc is 
also the place to discuss Colin's question about performance compared to a 
thread pool. If that option is available to us, it's preferable since it does 
not involve expanding the API.

Thanks!

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277047#comment-15277047
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Can we do this work on a branch? ...

I think we don't need a branch for the moment since this feature does not 
affect the other components.  This feature mainly adds new code but not 
changing the existing code much.

> Could someone post a design doc with the motivations, proposed API, and 
> discussion? ...

Motivations can be found in the description in this JIRA.  For proposed API, it 
makes more sense to discuss it in HADOOP-12910.  We could post some design doc, 
if it helps the discussion.

Thanks!

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-09 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277091#comment-15277091
 ] 

Zhe Zhang commented on HDFS-9924:
-

Echoing Andrew that a design doc should be added. In particular, does the scope 
include read/write? Or only metadata operations as shown in current subtask 
list? Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-10 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278651#comment-15278651
 ] 

Colin Patrick McCabe commented on HDFS-9924:


Hi all,

I am +1 for doing this work, based on possible performance improvements we 
might see, and the need for a convenient asynchronous API for applications.

However, I am concerned that there has been no design doc posted, but already 
code committed to trunk.  I am -1 on committing anything more to trunk until we 
have a design document explaining how the API will work and what changes it 
will require in HDFS.  Just to clarify, I would be fine on committing code to a 
feature branch without a design document, since we can review it later prior to 
the merge.  However, it is concerning to see such a large feature proceed on 
trunk without either a branch or a design that the community can review.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-10 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278717#comment-15278717
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Echoing Andrew that a design doc should be added. ...

Sure, I will post a design doc soon.  Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-10 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278776#comment-15278776
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> ... In particular, does the scope include read/write? Or only metadata 
> operations as shown in current subtask list? ...

The first step include include metadata operations such as rename, 
setPermission, etc.  It dose not include read/write.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-10 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278786#comment-15278786
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> However, I am concerned that there has been no design doc posted, but already 
> code committed to trunk. I am -1 on committing anything more to trunk until 
> we have a design document explaining how the API will work and what changes 
> it will require in HDFS. ...

The main changes (except the FileSystem API) required in HDFS and Commom are 
already done.  It is just some little changes since, as mentioned previously, 
the underlying RPC mechanism is already supporting asynchronous calls.  You 
would easily find it out if you read the code in more details.

Anyway, I will post a design doc soon as mentioned before.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-11 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280617#comment-15280617
 ] 

Colin Patrick McCabe commented on HDFS-9924:


With regard to error handling, why not handle all errors as exceptions thrown 
from {{Future#get}}?  Handling some errors in a different way because they 
happened "earlier" (let's say, on the client side rather than server side) 
forces the client to put error checking code in two places.

Does the {{Future#get}} callback get made without holding any locks?  Can other 
asynchronous calls be made from this context?

{code}
public boolean rename(Path src, Path dst) throws IOException {
  if (isAsynchronousMode()) {
return getFutureDistributedFileSystem().rename(src, dst).get();
  } else {
... //current implementation.
  }
}
{code}
It seems concerning that we would have to make such a large change to the 
synchronous {{DistributedFileSystem}} code.  This would also result in more GC 
load since we'd be creating lots of {{Future}} objects.  Shouldn't it be 
possible to avoid this?  I do not think having some kind of global async bit is 
a good idea.

bq. In order to avoid client abusing the server by asynchronous calls. The RPC 
client should have a configurable limit in order to limit the outstanding 
asynchronous calls. The caller may be blocked if the number of outstanding 
calls hits the limit so that the caller is slowed down.

Blocking the client seems like it could be problematic for code which expects 
to be asynchronous.  There should be an option to throw an exception in this 
case.

I also think that we could maintain a queue of async calls that we have not 
submitted to the IPC layer yet, to avoid being limited by issues at the IPC 
layer.

bq.­ Support asynchronous FileContext (client API)

{{AsynchronousFileSystem}} is a separate API from {{FileSystem}}.  If there are 
issues with {{FileSystem}}, surely we can fix them in 
{{AsynchronousFileSystem}} rather than creating a fourth API?

bq.­ Use Java 8’s new language feature in the API (client API).

Given that Hadoop 3.x will probably be Java 8 (based on the mailing list 
discussion), why not just make the async API use jdk8's {{CompletableFuture}} 
from day 1, rather than hacking it in later?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-11 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280686#comment-15280686
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> With regard to error handling, why not handle all errors as exceptions thrown 
> from Future#get? ...

For some case like network connection errors, if we do not throw exception 
until Future#get, the client could summit a large number of calls and the catch 
a lot of exceptions in Future#get.  It is fail-fast if the client catch an 
exception in the first async call.

> Does the Future#get callback get made without holding any locks? ...

Yes, it does not holding any locks.

> It seems concerning that we would have to make such a large change to the 
> synchronous DistributedFileSystem code. ...

I agree.

> Blocking the client seems like it could be problematic for code which expects 
> to be asynchronous. There should be an option to throw an exception in this 
> case.

Throwing exception is indeed better.

> I also think that we could maintain a queue of async calls that we have not 
> submitted to the IPC layer yet, to avoid being limited by issues at the IPC 
> layer.

This is a good idea although it may not be easy to implement.  Will check that.

> Given that Hadoop 3.x will probably be Java 8 (based on the mailing list 
> discussion), why not just make the async API use jdk8's CompletableFuture 
> from day 1, rather than hacking it in later?

Because Java 7 does not support it.  We would like to have a larger audience 
for the new async feature.

Will revise the design doc.  Thanks for the comments.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-11 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280708#comment-15280708
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

{quote}
 Blocking the client seems like it could be problematic for code which expects 
to be asynchronous. There should be an option to throw an exception in this 
case.
{quote}
It actually throws AsyncCallLimitExceededException so the client can keep 
trying sending more requests by catching it.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-11 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280809#comment-15280809
 ] 

Colin Patrick McCabe commented on HDFS-9924:


bq. For some case like network connection errors, if we do not throw exception 
until Future#get, the client could summit a large number of calls and the catch 
a lot of exceptions in Future#get. It is fail-fast if the client catch an 
exception in the first async call.

That makes sense.  Thanks for the explanation.

bq. It actually throws AsyncCallLimitExceededException so the client can keep 
trying sending more requests by catching it.

Hmm.  Is there a way for the client to wait until more async calls are 
available, without polling?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-11 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280839#comment-15280839
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

{quote}
Hmm. Is there a way for the client to wait until more async calls are 
available, without polling?
{quote}
Inside the polling, client could be blocked, e.g.
{code}
 for (;;) {
  try {
Future returnFuture = adfs.rename(src, dst, Rename.OVERWRITE);
returnFutures.put(i, returnFuture);
break;
  } catch (AsyncCallLimitExceededException e) {
/**
 * reached limit of async calls, fetch results of finished async
 * calls to let follow-on calls go
 */
start = end;
end = i;
// call Future#get to read results of complete async calls
waitForReturnValues(returnFutures, start, end);
  }
}
{code}

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-11 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280844#comment-15280844
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

Calling Future#get to read results of complete async calls will decrease limit 
counter in RPC layer so that clients can submit more async requests in outer 
loop.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-16 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285597#comment-15285597
 ] 

Ming Ma commented on HDFS-9924:
---

Some general comments: When MR ran into similar performance issues, it either 
revised how it uses HDFS as in MAPREDUCE-6336, or it used multiple threads as 
in MAPREDUCE-2349. The multiple threads approach might work well for some 
scenarios, but might not be desirable if it is launched inside a YARN container 
where there could be other containers on the same machine. On that note, it 
might be useful to provide async support for listStatus as well to simplify 
MAPREDUCE-2349.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-16 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285655#comment-15285655
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

I filed a Jira HDFS-10413 for this, thank you [~mingma].

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-18 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289952#comment-15289952
 ] 

Andrew Wang commented on HDFS-9924:
---

In my earlier comment, I asked for the following content to be covered in the 
document:

bq. It'd help to go over the pros/cons of the different API options. 
ListenableFuture for instance has also been brought up. Reviewing some other 
async RPC interfaces for comparison would also be helpful. This design doc is 
also the place to discuss Colin's question about performance compared to a 
thread pool. If that option is available to us, it's preferable since it does 
not involve expanding the API.

Neither of the two things I asked for are present in this design doc.

On the topic of API, I don't want to pigeonhole us with the Java Future API. As 
others have mentioned here, it doesn't allow for callback chaining, which makes 
it much less useful for async-style programming. Other popular async APIs use 
StumbleUpon's Deferred to support callbacks (HBase, Kudu). If we were to use 
Deferred, we should shade it so it doesn't lead to any classpath issues.

Another sensible choice would be CompletableFuture from JDK8. This means 
AsyncFileSystem would be 3.x-only, but considering we're actively trying to 
release 3.x, it's not a bad release vehicle. This also would give downstreams 
time to try it out and give feedback.

On the topic of performance, can you please provide benchmarks?

Also, these patches are already landing in branch-2 and trunk while there are 
outstanding API questions. Can you please revert and move them to a branch? I'm 
-1 on releasing this in 2.8 since that locks the API.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-18 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290182#comment-15290182
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Also, these patches are already landing in branch-2 and trunk while there are 
> outstanding API questions. Can you please revert and move them to a branch? 
> I'm -1 on releasing this in 2.8 since that locks the API.

The patches committed only provides internal @Unstable API.  So it does not 
lock the API.  No?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-18 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290292#comment-15290292
 ] 

Ming Ma commented on HDFS-9924:
---

It seems the thread-pool based solution is a layer on the top of FileSystem 
abstraction, thus it is general for all FileSystems.  BTW, is the API design 
somewhat independent of whether we use thread pool or async RPC client?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-18 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290464#comment-15290464
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

Thank you [~mingma].
{quote} BTW, is the API design somewhat independent of whether we use thread 
pool or async RPC client?
{quote}
No, async DFS API goes to async RPC path. Like you said, thread pool is from 
API point of view, instead of RPC. Of course, thread pool can also be used to 
schedule async DFS calls if it comes with benefits.

[~andrew.wang] I will post performance numbers soon. Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-19 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291526#comment-15291526
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

 I need to clarify 'No' meant sync/async calls actually go to different RPC 
path. For your question, of course yes. [~mingma]

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293809#comment-15293809
 ] 

Colin Patrick McCabe commented on HDFS-9924:


I have to agree with [~andrew.wang] that it makes more sense to put these 
changes in trunk than in branch-2.  The Hadoop 2.8 release has been blocked for 
a very, very long time.  There are tons of features in branch-2 that have been 
waiting for almost a year to be released.  Adding yet another feature to 
branch-2, when we're so far behind on releases, doesn't make sense.

Programmers who are familiar with Node.js will want something that supports 
callback chaining, like CompletableFuture, rather than something like the 
old-style Future API.  If we target this at branch-3, we can use the jdk8 
CompletableFuture.

If we are going to backport this to branch-2, we should do it once the feature 
is done, rather than backporting bits and pieces as we go.  This is especially 
true when there are still open questions about the API.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-20 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293942#comment-15293942
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> ... The Hadoop 2.8 release has been blocked for a very, very long time. ...

I guess you might have misunderstood the release process.  The release manager 
could include/exclude any feature as she/he pleases.

> Programmers who are familiar with Node.js will want something that supports 
> callback chaining, 

Good point!  On the other hand, programmers who are NOT familiar with Node.js 
may NOT want something that supports callback chaining.

Also, you might not have noticed, supporting Future is a step toward supporting 
CompletableFuture.

> If we are going to backport this to branch-2, we should do it once the 
> feature is done, rather than backporting bits and pieces as we go. ...

I disagree.  Async HDFS is a collections of async methods.  We don't not have 
to support all the methods at day one.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294073#comment-15294073
 ] 

Colin Patrick McCabe commented on HDFS-9924:


bq. Good point! On the other hand, programmers who are NOT familiar with 
Node.js may NOT want something that supports callback chaining.

Callback chaining is an optional feature which nobody is forced to use.  I 
don't see why anyone would prefer Future over CompletableFuture.

bq. Also, you might not have noticed, supporting Future is a step toward 
supporting CompletableFuture.

I don't see why supporting Future is a step towards supporting a different API. 
 I think Hadoop has too many APIs with duplicate functionality already, and we 
should try to minimize the cognitive load on new developers.

bq. I guess you might have misunderstood the release process. The release 
manager could include/exclude any feature as she/he pleases.

Which 2.x release do you want this to become part of?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-24 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299161#comment-15299161
 ] 

Andrew Wang commented on HDFS-9924:
---

I'm still not convinced enough to change my -1 on Future in 2.8.

Even if what's currently committed is marked Unstable, I don't want to rush 
ahead with an API we know is insufficient for async-style programming. Earlier 
in this JIRA's comments, others were asking about ListenableFuture for the same 
reasons. It's not fair to push the burden of supporting multiple APIs onto our 
downstreams, when we have a few possible solutions close at-hand:

* Use Deferred, which HBase and Kudu adopted due to the lack of 
CompletableFuture in JDK7. ListenableFuture might be good too.
* Target this for 3.0 and use CompletableFuture. We're actively working on 3.0, 
and the first 3.0.0 alpha is likely coming out around the same time as 2.8.0.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-24 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299174#comment-15299174
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> ... . It's not fair to push the burden of supporting multiple APIs onto our 
> downstreams, ...

We are not going to support multiple APIs.  Once we have decided the async API, 
the unstable API can be removed.  That is the meaning of "unstable".

The down streams are intelligent people.  They can decide whether they want to 
use the unstable API.  It is even more unfair if we delay to provide any async 
API to the down streams.  No?

[~andrew.wang], is it your intention to slow down the async hdfs development?  
I hope not.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-24 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299220#comment-15299220
 ] 

Andrew Wang commented on HDFS-9924:
---

Nicholas, I proposed two solutions above, neither of which you have commented 
on. Have you looked into Deferred and CompletableFuture? This is also why I 
asked for a review of other async APIs in the design doc.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-25 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300256#comment-15300256
 ] 

Daryn Sharp commented on HDFS-9924:
---

I'm late to the game due to time constraints, but this feature greatly concerns 
me.

It's true the NN can handle over 100k ops/sec but only with a read-dominated 
workload.  Even then, I've had to do _a lot_ of internal (hopefully soon to be 
published) performance work to prevent blowing the heap under such a sustained 
load - recent user pushed a NN to 90k ops/sec for most of weekend and barely 
dented the heap.  BUT it was 81% read ops.  In the past that would have been a 
8-10 min GC.  I digress.

More on point: The intended use case is for mass write operations.  Consider 
this: on multiple large clusters, offloading just a few thousands write ops/sec 
for log aggregation reduced 95th ptile processing time from 4ms to <.5ms and 
queue time from 20ms to 4ms.  The extremely wild variance in the metrics also 
stabilized.

I've already been having performance concerns with hive's mass 
setOwner/setPermission which I believe is single-threaded.  This feature 
appears intended for hive.  I'm really hesitant for a feature that makes it 
trivial to destroy a NN.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-25 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301129#comment-15301129
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Nicholas, I proposed two solutions above, neither of which you have commented 
> on ...

As mentioned previously, please have the API discussion in HADOOP-12910.  
Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-25 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301142#comment-15301142
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> ...  I'm really hesitant for a feature that makes it trivial to destroy a NN.

I understand you concern but it is a different problem.  We should not protect 
NN by making the client slow.  We should add protection in NN instead.  For 
example, we recently implemented RPC scheduler/callqueue backoff using response 
times (HADOOP-12916).

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-05-27 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304703#comment-15304703
 ] 

Daryn Sharp commented on HDFS-9924:
---

I agree that we can't protect the NN with a "slow" client.  If hive wasn't 
mentioned, I'd probably have not thought to be concerned.  Hive's unnecessary 
heavy write op load has already placed it on my naughty list internally.  This 
well-intentioned feature is enabling hive to place much more stress on the NN.

I glanced at HADOOP-12916 but I don't think it's sufficient, actually I think 
it will have the opposite effect - hopefully I misinterpreted the design.  
Priority level is based on incoming call rate.  The client is told to backoff 
if the NN is being "too slow" for that priority.  The problem I see is:
* User1 has a write op heavy load
* User2 perhaps a wide job is doing 20X the ops of User1, but is read heavy
* User2 is dropped into a lower priority bucket based on call rate
* User1 drives up the avg processing time with costly writes
* User2 rapidly forced to backoff primarily from an accounting flaw of every 
write op is cumulatively billed to all subsequent queued calls (processing time 
includes lock wait from prior calls), artificially driving up the average
* User2 retries and is double billed on call rate since user is charged even if 
forced to backoff
* User2 falls into even a lower bucket, enabling User1 to further drive up 
processing time
* Finally... User3 & User4 were running long getContentSummary calls - which 
acquires/releases lock.  The prioritized writes made those summary calls appear 
to take a min.  Explodes average.  Users are told to backoff from mostly empty 
queues.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-01 Thread Xiaoyu Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310837#comment-15310837
 ] 

Xiaoyu Yao commented on HDFS-9924:
--

[~daryn], thanks for the valuable feedback. @Kihwal Lee also mentioned similar 
issue 
[here|https://issues.apache.org/jira/browse/HADOOP-12916?focusedCommentId=15277342&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15277342].
 But I wasn't able to get clarification of it. The FSN/FSD locking issue is a 
very good point. I tried to find some metrics/logs about it but there was not 
any. I will open a separate ticket to add more metrics and WARN/DEBUG logs for 
long locking operations on namenode similar to what we have for slow 
write/network WARN/metrics on datanode.  

As you mentioned above, the priority level is assigned by scheduler. As part of 
HADOOP-12916, we separate scheduler from call queue and make it pluggable so 
that priority assignment can be customized as appropriate for different 
workloads. For the mixed write intensive and read workload example, I agree 
that the DecayedRpcScheduler that uses call rate to determine priority may not 
be the good choice. We have thought of adding a different scheduler that 
combines the weight of RPC call and its rate. But it is tricky to assign 
weight. For example,  getContentSummary on a directory with millions of 
files/dirs and a directory with a few files/dirs won't have the same impact on 
NN. 

Backoff based on response time allows all users to stop overloading namenode 
when the high priority RPC calls experience longer than normal end to end 
delay. User2/User3/User4 (low priority based on call rate) will have much wider 
response time threshold for backing off. In this case, User 1 will be backed 
off first by breaking the relative smaller response time threshold and get 
namenode out of the state that other users can not use the namenode "fairly". 

We are also proposing to have a scheduler that offers better namenode resource 
management via YARN integration on HADOOP-13128. I would appreciate if you can 
share your thoughts and comments on the proposal there as well. Thanks!


> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-01 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311128#comment-15311128
 ] 

Andrew Wang commented on HDFS-9924:
---

I asked for this earlier, haven't seen any action yet: can we move all the 
patches involving user-facing APIs to a branch? We still haven't converged on 
the API, and I don't want this appearing in a release until that's settled.

I can do the git work if that's helpful, it looks like the new code is pretty 
separate.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-01 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311199#comment-15311199
 ] 

Jitendra Nath Pandey commented on HDFS-9924:


  What is the expected timeframe for 2.8 release? Hopefully, we will settle on 
the API by then. The code in trunk or branch-2 need not move out at all as 3.0 
and 2.9 releases are still far out. In the current shape, the code works and 
doesn't de-stabilize the branches and most of the work is complete. Therefore, 
we don't need to hurry to move it out and add overhead of merging it back.
Instead, we would try to expedite the convergence on the API. Based on 
discussion in HADOOP-12910, it does seem like there is a lot of demand for 
Future with callback. We should plan to add that, ideally in a way that works 
on both 3.x and 2.x. Reposting [~szetszwo]'s comment here:
bq. It seems that people really want Future with callback. I will think about 
how to do it... 

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-01 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311205#comment-15311205
 ] 

Andrew Wang commented on HDFS-9924:
---

Based on what I've seen, people are actively trying to resolve 2.8 blockers and 
pushing things out to later releases. I'm trying to do the same for the first 
3.0 alpha. We're mainly blocked on HADOOP-12893, which (fingers crossed) is 
getting close.

I'm happy to do the git work if that's the main concern; I think it'll be 
fairly easy to move it out and back in later, since the new stuff is pretty 
separate.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-01 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311220#comment-15311220
 ] 

Andrew Wang commented on HDFS-9924:
---

Also to be clear, I'm only talking about backing out the changes that are part 
of the user-facing API. We can leave the RPC engine changes, since like you 
said it seems stable.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-03 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315225#comment-15315225
 ] 

Andrew Wang commented on HDFS-9924:
---

I've gone ahead and moved the commits for AsyncFileSystem and TestAsyncDFS to a 
new HDFS-9924 branch. I checked that this branch is identical to the previous 
state of trunk before the reverts, and updated the fix versions of related 
JIRAs. Compile tested all the branches.

Here's the list of commits, also should be findable via JIRA query for 
HDFS-9924 fix version:

{quote}
HDFS-10224
HADOOP-12957
HDFS-10346
HADOOP-13168
HDFS-10390
HDFS-10431
HDFS-10430
HADOOP-13226
{quote}

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-03 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315333#comment-15315333
 ] 

Jitendra Nath Pandey commented on HDFS-9924:


bq. I'm -1 on releasing this in 2.8 since that locks the API.
[~andrew.wang], your objection was for release in 2.8. Why did you revert from 
trunk and branch-2? 

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-04 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315514#comment-15315514
 ] 

Allen Wittenauer commented on HDFS-9924:


bq. your objection was for release in 2.8. Why did you revert from trunk and 
branch-2? 

It's pretty obvious this should be going into a feature branch and requiring a 
branch-merge vote.

I've been noticing a very disturbing trend of big features that require a lot 
of changes getting committed piecemeal over the past year. I really wish the 
PMC was more actively watching what is actually going on in the source tree 
because this type of behavior ultimately hurts stability.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316328#comment-15316328
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

{quote}
I've gone ahead and moved the commits for AsyncFileSystem and TestAsyncDFS to a 
new HDFS-9924 branch. I checked that this branch is identical to the previous 
state of trunk before the reverts, and updated the fix versions of related 
JIRAs. Compile tested all the branches.
{quote}
[~andrew.wang], we may not arbitrarily revert committed patches even if you 
disagree the current approach.  We are open for more discussion but I will 
first revert the reverts you did.  Please respect yourself as committer and 
respect your commit privilege.  Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316405#comment-15316405
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> It's pretty obvious this should be going into a feature branch and requiring 
> a branch-merge vote. ...  I really wish the PMC was more actively watching 
> what is actually going on in the source tree because this type of behavior 
> ultimately hurts stability.

We did not need a branch here since, as mentioned before, this change was 
adding mostly new code but not changing existing code much.  Therefore, this 
feature won't hurt stability.

You might be upset by other features such as HDFS symlink.  I am sorry about 
that.  I believe the PMC is already closely watching the commits.  In case you 
find any problems, please rise your concern to the PMC.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316494#comment-15316494
 ] 

Allen Wittenauer commented on HDFS-9924:


bq. We did not need a branch here since, as mentioned before, this change was 
adding mostly new code but not changing existing code much. Therefore, this 
feature won't hurt stability.

It's a new feature with a significantly large API surface area. It should 
absolutely require extra scrutiny before going in, new code or not.

bq. You might be upset by other features such as HDFS symlink.

I was actually thinking of some of the crazy things that are going on in YARN. 

bq.  I believe the PMC is already closely watching the commits.

... and you'd be very wrong.  Some of the things getting added with no 
resistance is just amazing to me.  (e.g., "let's destroy the NN box by writing 
metrics to a log outside the metrics system using config parameters that don't 
match anything else in HDFS" is my current favorite.)

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316544#comment-15316544
 ] 

Jitendra Nath Pandey commented on HDFS-9924:


[~aw], yes this is a new feature, but not very large. The work is already 
complete and the functionality works in its current form already. Only one 
point of discussion was to enhance the current API to support callbacks as 
well, which was accepted as a useful enhancement that many people need. 
  Also, the earlier objection was raised for only 2.8 release, but patches were 
reverted from all the branches arbitrarily.

I propose following next steps to make quick progress here:
1) The design should be enhanced to support callback in the API. A jira should 
be created as a subtask here for the change.
2) The jira may be marked a blocker for 2.8 release. However, It is release 
manager's discretion to decide what is a blocker. When we get close to 2.8 and 
it is still not resolved the feature can be disabled or reverted, again 
depending on release manager's decision. The usual practice is that release 
manager prods the developers and they try to expedite the fixes and 
improvements for the release. As far as I understand this is not going to be a 
complicated improvement to add.

Lets just expedite to fix (1) above, and put this behind us.


> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316696#comment-15316696
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

{quote}
It's a new feature with a significantly large API surface area. It should 
absolutely require extra scrutiny before going in, new code or not.
{quote}

Allen Wittenauer, this is actually a switch-on/off feature. If you have a close 
look at the design and implementation, async calls run async code path, leaving 
code path of sync call as is to the maximum extent. This doesn't cause any 
serious stability issue. In addition, user facing APIs are marked as @UnStable 
to accommodate possible evolution. Thanks.


> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316756#comment-15316756
 ] 

Allen Wittenauer commented on HDFS-9924:


I'm not worried about code stability. branch-2 hasn't been rock-solid stable 
since 2.4; that ship left the port years ago and every code dump into it in 
order to make press releases out of features has just made it worse.  I'm much 
more worried about API correctness.  We're going to be stuck with whatever is 
decided here for 5+ years, regardless of whatever the annotation says. We've 
seen it with every other major API in Hadoop and I can't see why this would be 
any different.  Waiting a while to actually let more folks play with it before 
pushing it into a release (including the 3.x release that we're working to cut 
from trunk) just seems like an obvious, common sense thing to do.

Also:  given that Andrew IS the RM for 3.x, him removing it from trunk is 
pretty much allowed.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316788#comment-15316788
 ] 

Arpit Agarwal commented on HDFS-9924:
-

bq. Some of the things getting added with no resistance is just amazing to me. 
(e.g., "let's destroy the NN box by writing metrics to a log outside the 
metrics system using config parameters that don't match anything else in HDFS" 
is my current favorite.)
Never let facts stand in the way of a righteous rant Allen. This is not the 
right place for a discussion we've had ad nauseum elsewhere but I challenge you 
to explain how the NN box is being destroyed.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316850#comment-15316850
 ] 

stack commented on HDFS-9924:
-

I'd suggest we work out a coherent, global filesystem async API/strategy before 
we start committing implementations (piecemeal) otherwise we will frustrate our 
users.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317629#comment-15317629
 ] 

Colin Patrick McCabe commented on HDFS-9924:


+1 for a feature branch

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317717#comment-15317717
 ] 

stack commented on HDFS-9924:
-

Ugh. I meant to say, we risk having different 'async' API/implementations if we 
piecemeal the implementation ahead of our figuring a general approach for 
async'ing the Filesystem. What is committed currently is inadequate according 
to the discussion so far missing callback. Retrofitting callback on Future, as 
I understand it, will require a different implementation; therefore the commits 
are premature. Revert in the meantime seems like the right thing to do.

bq. I'm much more worried about API correctness Waiting a while to actually 
let more folks play with it before pushing it into a release (including the 3.x 
release that we're working to cut from trunk) just seems like an obvious, 
common sense thing to do.

Above makes sense to me.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317858#comment-15317858
 ] 

Jitendra Nath Pandey commented on HDFS-9924:


The features are not played or experimented with if they are in a feature 
branch. No one will use if it is not part of a release, or at least in a 
release branch or trunk.
The best way to actually harden an API is to have downstream users use them and 
provide feedback. @Unstable was defined for exactly the same purpose. 
  As suggested by [~steve_l] earlier, if it helps an additional annotation 
@Experimental can be created to more strongly highlight the 
unstable/experimental nature of the API. 
  The current implementation is not a user facing API therefore it will have to 
be modified when final API is decided in HADOOP-12910.

bq. ... figuring a general approach for async'ing the Filesystem.
There is a proposal on HADOOP-12910 and lets have a discussion around that, to 
ensure right API gets committed. 

bq ... Retrofitting callback on Future,
I agree it is not a good idea. I think CompletableFuture in jdk8 seems like a 
good choice. I am leaning to support CompletableFuture in trunk, and if its too 
hard to mimic in branch-2, don't expose a user facing API in branch-2 at all.
Users in branch-2 can use the current unstable/experimental version in 
{{DistributedFileSystem}} to try out. Since CompletableFuture implements 
Future, it will still not be incompatible in trunk, although we don't guarantee 
that. 

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-06 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317887#comment-15317887
 ] 

Jitendra Nath Pandey commented on HDFS-9924:


I think {{AsyncDistributedFileSystem}} should be annotated as Private, similar 
to {{DistributedFileSystem}} to further highlight that it is not external 
facing.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-07 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318054#comment-15318054
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> I think AsyncDistributedFileSystem should be annotated as Private, ...

Yes, we should, @Private or @LimitedPrivate.

It seems that there are some serious confusions here.  FileSystem is a user 
facing public API, however, DistributedFileSystem is not.  
DistributedFileSystem is always an internal API and never a public API.  Note 
that FileSystem was annotated as @Public @Stable (in 2010) but 
DistributedFileSystem was annotated as @LimitedPrivate @Unstable (in 2012).  
Therefore, adding or changing APIs to DistributedFileSystem do not affect any 
user facing public API at all.

What have we done so far?  We have added some methods to DistributedFileSystem 
and a new internal @Unstable class AsyncDistributedFileSystem.  The FileSystem 
API remains unchanged.  So there is no change in any user facing public API at 
all.

Please let me know if you disagree.  Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-08 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321393#comment-15321393
 ] 

Andrew Wang commented on HDFS-9924:
---

The API discussion is still unresolved, though it seems many people are 
advocating for an API that is different from what is currently committed.

So, do we still see a benefit from having the current code committed to the 
release branches vs. a feature branch? The code is also likely to change a lot 
if the API changes. It's also only in DistributedFileSystem, so it's not used 
by any current code paths. Finally, considering that we have multiple 
downstream users saying they want a different API, it doesn't seem likely 
they'll do much testing of the current DFS-level API anyway. 

So for these reasons (though not well-communicated earlier), my preference is 
to move this to a feature branch. Other contributors have also voiced similar 
requests for a feature branch. As I said on the common-dev discussion, I'm 
happy to review a future merge vote if there is a concern about getting three 
+1's.

Somewhat separately, Vinod expressed a preference on common-dev for leaving new 
features out of 2.8, since it's late in the release cycle.

Hoping we can reach consensus by end of week (Friday). I'm also happy to again 
do the requisite git work.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-08 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321910#comment-15321910
 ] 

Ming Ma commented on HDFS-9924:
---

Based on the discussion so far, it seems like the existing async API needs to 
be changed. But not sure if the new async API has to be done in a feature 
branch. How about the option of reverting the existing async API and after that 
develop the new async API directly in trunk, and optionally branch-2 (not 
branch-2.8) assuming we will go through more thorough discussion this time with 
broader consensus?

* Java8 future API is more flexible and supports different kinds of future 
compositions and make it easy to support dependent operations. In fact, the 
discussion in HADOOP-12910 proposes this for trunk.

* HADOOP-12910 also discussed the API design for branch-2, but not for 
branch-2.8. It appears the async requirement comes from Hive. If we leave the 
API as it is in 2.8, does it mean Hive needs to hard code to use 
AsyncDistributedFileSystem? Or we can have Hive use multiple threads to 
interact with FileSystem just like how MAPREDUCE does split calculation; this 
approach doesn't require any async support in 2.8. Either way, Hive needs to 
change how it uses HDFS when upgraded from 2.8 to 2.9 if we plan to put the new 
async API to 2.9.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-08 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321921#comment-15321921
 ] 

Jitendra Nath Pandey commented on HDFS-9924:


   The current implementation is not that far from the options being discussed 
in HADOOP-12910. The CompletableFuture and ListenableFuture both extend java 
Future therefore it is a compatible API. The current AsyncGetFuture extends 
AbstractFuture which is a subclass of ListenableFuture. Therefore, the current 
implementation may not be completely different from what we want in trunk. 
   
  I think ListenableFuture is a good choice to have a consistent API in 
branch-2 and trunk, and addresses the needs of most of the downstream users. 
Please weigh in your opinion on this.
  I would still recommend that for now, lets add ListenableFuture to 
AsyncDistributedFileSystem only, and not to the user facing API.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-08 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321932#comment-15321932
 ] 

Andrew Wang commented on HDFS-9924:
---

Thanks for the comment Ming. Sounds like you're in favor of reverting what's 
currently there. Though, I'm still trying to determine the benefit of doing 
even future development in trunk vs. a feature branch. We're going to be doing 
3.0 alpha releases directly from trunk, thus it's no longer an "integration" 
branch as we historically have treated it. Also for this feature, there's not 
much need for integration since it's separate new code not used by other parts 
of Hadoop.

Since we're still doing 3.0 alphas at this point, I'm open to including a trial 
async API to make it easier for downstream testing, but maybe for a later 
alpha. Until we internally agree on what the API should look like, it's 
probably not ready for downstream testing, and thus not ready for a release 
branch like trunk/branch-2/etc.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322086#comment-15322086
 ] 

Zhe Zhang commented on HDFS-9924:
-

My 2 cents:

I think the async feature should be done in a feature branch. As [~andrew.wang] 
said above and [~cmccabe] mentioned on the dev@ maillist, the overhead of using 
a feature branch (especially in this case where most changes are new code) is 
pretty minimum.

The only valid concern against a feature branch is that without an actual 
release, we will get less testing coverage from downstream projects. But the 
experience from EC is that if an advanced user wants to test a feature, he/she 
will just spend a little more time to build from the source code of the feature 
branch. This overhead is less significant compared with the degraded user 
experience if we release a set of APIs in alpha0 and change them in a later 
release, say alpha1. I think the async feature is attractive enough for at 
least a few downstream projects to try from feature branch.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322280#comment-15322280
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Based on the discussion so far, it seems like the existing async API needs to 
> be changed. ...

The existing async API is returning Future.  From the discussion in 
HADOOP-12910, we are going to use ListenableFuture or CompletableFuture.  Both 
of them are a sub interface of Future.  Changing from Future to 
ListenableFuture or CompletableFuture is a backward compatible change.

> ... Either way, Hive needs to change how it uses HDFS when upgraded from 2.8 
> to 2.9 if we plan to put the new async API to 2.9.

Provided that the new async API is backward compatible with the current API 
(which returns Future) as mentioned above, Hive does NOT need to change.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322305#comment-15322305
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

It seems that the disagreement is about the API (i.e. what to return?  Future, 
ListenableFuture, CompletableFuture or Deffered?).  This JIRA and the current 
implementation originally aim to support a basic async access to HDFS without 
callback support and without chaining support.  Therefore, we choose to return 
Future in the current API.  From the discussion in HADOOP-12910, it is very 
likely that we will support a sub-interface of Future as the return type.  
Then, the current API is fine as the first step.  When we change to return 
XxxFuture in the future, it is a backward compatible change.

If we do need a branch, the branch should be for HADOOP-12910 but not this JIRA.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322864#comment-15322864
 ] 

stack commented on HDFS-9924:
-

bq. This JIRA and the current implementation originally aim to support a basic 
async access to HDFS without callback support and without chaining support.

The JIRA is about HDFS async access. There is no exception in the summary nor 
description to rule out the basic async callback primitive. You could rule it 
out via fiat -- can you even call it an 'async' API if it doesn't do callback? 
-- but why not do it right from the get-go. Do it once only too.

bq. When we change to return XxxFuture in the future, it is a backward 
compatible change.

You and [~jnp] have said this a few times but for downstreamers, a Future-only 
API is not worth engaging with. It means each of us has to build parking 
structures to keep the unfinished the Futures in, polling to look for 
completions to react too. This is a performance-killer. Been there. Done that.

I like the [~mingma] summary/suggestion with the [~andrew.wang] caveat; revert 
and dev in a feature branch against trunk. I know of a few downstreamers that 
are interested, myself included, and would be up for helping out. Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323312#comment-15323312
 ] 

Andrew Wang commented on HDFS-9924:
---

Thanks everyone for the input. It sounds like stack and Zhe would also like a 
branch. Jitendra and Nicholas think that since the TBD callback API will extend 
Future, it'll be compatible in trunk. Based on current downstream input though, 
they don't seem interested in a just-Future API since it doesn't support 
callbacks.

So, are there any technical reasons why the current Future-based code should be 
included in release branches vs. a feature branch? I discussed this in my 
earlier comment, and would be interested in hearing counter arguments.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Vaibhav Gumashta (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323474#comment-15323474
 ] 

Vaibhav Gumashta commented on HDFS-9924:


>From Hive's perspective, I do think this is a useful feature that we can 
>leverage. I feel it will be easier to use and test if it remains in a released 
>branch rather than a feature branch.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323575#comment-15323575
 ] 

Kai Zheng commented on HDFS-9924:
-

bq. From Hive's perspective, I do think this is a useful feature that we can 
leverage.
Sounds great! This is a good message and my side is investigating to do a 
prototype and benchmark the performance gain. I don't mind if it's in a feature 
branch, though I know it will eventually need a formal release to commit the 
work.

About the API, I have a little concern (maybe invalid): if it's going to be too 
Java style or fancy, it may be difficult to be implemented in c/c++ part, 
right? IMO, this async operations would be much interested by native 
applications for performance.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323574#comment-15323574
 ] 

Ashutosh Chauhan commented on HDFS-9924:


Hive is currently using hadoop-2.x line and is expected to continue using it in 
foreseeable future. So, if this feature is not available in 2.x it won't be 
useful for Hive (and thus Hive users) If that implies it has to be done on 2.x 
line using jdk7 future, we can live with that. That will still be an 
improvement than current state of the art.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323603#comment-15323603
 ] 

Andrew Wang commented on HDFS-9924:
---

[~vgumashta] and [~ashutoshc] thanks for joining the discussion. Since we have 
you here, could you comment on the alternative of using a threadpool for this 
usecase in Hive? It seems far more maintainable than us committing a private 
API you'd need to hardcode against, especially an API that we are already 
planning to change. Threads in Java are pretty cheap, and the calls are likely 
network/NN bound anyway.

Our other downstream users are pushing for a callback API, so if the 
Future-based API is solving a problem unique to Hive, I'm curious what was 
already tried on the Hive side. Even if a threadpool is less optimal, it might 
tide you over until the callback API is ready (which would make everyone happy).

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Vaibhav Gumashta (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323608#comment-15323608
 ] 

Vaibhav Gumashta commented on HDFS-9924:


[~andrew.wang] Specifically for HiveServer2 and Metastore servers, which serve 
many concurrent users, it will be a resource hit to use a threadpool to get 
around the current limitation of a blocking call.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-09 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323866#comment-15323866
 ] 

Andrew Wang commented on HDFS-9924:
---

I understand there is a cost to using threads, but my point is that threads are 
quite cheap. If it's truly RPC bound, a pool of just 10 threads would mean a 
10x improvement, at the cost of a few MB for stacks. Even 100 threads might 
barely register considering heap sizes are typically measured in GBs.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327975#comment-15327975
 ] 

Andrew Wang commented on HDFS-9924:
---

I'd like to keep pushing on this, since as stated previously, we have more than 
a few committers who've already asked for this to be on a feature branch. Given 
that the alternative of using a threadpool seems quite acceptable for users who 
simply want some parallelism and not callbacks, I'm still searching for a 
technical argument for a Future-based API.

If there are no further comments by EOD tomorrow, I'd like to go ahead and move 
the commits to the feature branch.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328239#comment-15328239
 ] 

Ashutosh Chauhan commented on HDFS-9924:


Pasting from HADOOP-12910

I agree with Steve Loughran assessment. jdk7 Future based api will be of 
immediate help for Hive. Threadpool based approach complicates Hive code base 
for maintainability and is not preferable. We already have shims to deal with 
multiple versions of hadoop, so if 3.x api uses jdk8 constructs, it will be 
fairly trivial for us to integrate with.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328290#comment-15328290
 ] 

stack commented on HDFS-9924:
-

[~ashutoshc] Can you make a bit of a better argument than citing the mighty 
[~steve_l] please? Dealing with a mess of returned Futures will also complicate 
Hive codebase, no? Can you explain why an half-an-async HDFS API would be 
easier for you to deal with? Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328291#comment-15328291
 ] 

Andrew Wang commented on HDFS-9924:
---

[~ashutoshc] I still don't find this argument compelling, for the following 
reasons:

* A threadpool in Hive would work across all Hadoop versions, and without 
requiring a new release of Hadoop with this API.
* A threadpool is just a few extra MBs of memory.
* A threadpool isn't that much code, particularly with the use of Executors.
* We'd have to carry around a much larger amount of code in HDFS, and an API 
which we already know needs to be changed to fit our other downstream usecases. 
It pushes a larger maintenance burden onto HDFS compared to a threadpool in 
Hive.
* Building on the previous, Hive is the only downstream user asking for this 
feature. Other downstreams are saying that they really need callbacks for this 
to be useful.

If you still feel differently, please help me understand why the above points 
are not applicable. Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328642#comment-15328642
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> If there are no further comments by EOD tomorrow, I'd like to go ahead and 
> move the commits to the feature branch.

Please don't move anything to the feature branch.  We are still discussing it.  
There are different opinions.  Again, if we need a feature branch, the branch 
should be for HADOOP-12910 but not this JIRA.

> A threadpool is just a few extra MBs of memory.

It depends on how many threads in the pool.  It could be GBs of memory.  You 
may try Colin's code above in his comment on creating 20,000 threads to see if 
it is a few extra MBs of memory.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328734#comment-15328734
 ] 

Andrew Wang commented on HDFS-9924:
---

What are the technical arguments for keeping this code in release branches? I'm 
happy to call the branch HADOOP-12910 if that's what's being asked for.

bq. It depends on how many threads in the pool. It could be GBs of memory.

I was never talking about creating thousands of threads. As I said in my above 
comments, if Hive is truly RPC bound, they can get 10x improvement with 10 
threads, which is a few MB for thread stacks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328805#comment-15328805
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> What are the technical arguments for keeping this code in release branches? 
> ...

This is the usual development procedure.

> I was never talking about creating thousands of threads. As I said in my 
> above comments, if Hive is truly RPC bound, they can get 10x improvement with 
> 10 threads, ...

Why using 10 threads but not 20,000 threads?  Why settling on 10x but not 
20,000x?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328819#comment-15328819
 ] 

Andrew Wang commented on HDFS-9924:
---

bq. This is the usual development procedure.

Yet, we have 5 or more committers who are asking that this code be developed on 
a feature branch. So the consensus is pointing toward a feature branch.

I'll let Vaibhav and Ashutosh comment on the suitable number of threads for 
Hive. Let's just say though that the right number is far less than 20k, given 
that the NN typically doesn't have that many handler threads. If Hive is also 
currently struggling such that it needs to do 20,000x its current NN RPC rate, 
my take is that it should really be looking at how to reduce the # of RPCs 
instead.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328830#comment-15328830
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Yet, we have 5 or more committers who are asking that this code be developed 
> on a feature branch. So the consensus is pointing toward a feature branch.

Please don't jump to conclusion.  There are also committers, probably more than 
5, saying that we don't need a branch.  Thanks.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328835#comment-15328835
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> I'll let Vaibhav and Ashutosh comment on the suitable number of threads for 
> Hive. Let's just say though that the right number is far less than 20k, ...

Then, why choose 10 but not 20 or 100?  More generally, how do decide what is 
the number of threads to use?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328850#comment-15328850
 ] 

Andrew Wang commented on HDFS-9924:
---

Per our bylaws, code is integrated based on consensus, and here we have not one 
but multiple committers saying they do not agree with this code being present 
in release branches in its current form. So, if you'd like this code to remain 
in release branches, it's on you to convince us otherwise.

And per my previous comment, I'll let Vaibhav and Ashutosh comment on the Hive 
usecase. I don't find tuning questions to be very relevant though until we've 
at least evaluated the simple solution of a fixed size pool of 10 threads.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328861#comment-15328861
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9924:
---

> Per our bylaws, code is integrated based on consensus, ...

Which part of the bylaws are you talking about?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328931#comment-15328931
 ] 

Arpit Agarwal commented on HDFS-9924:
-

Hi [~andrew.wang], the discussion has diverged from the merits of the API to 
branching.

A feature branch is not an end goal in itself - it should aid feature 
development. We have generally used feature branches to avoid destabilizing 
release branches and that doesn't appear to be a risk here. So branching just 
defers the API discussion to branch merge time.

Moving on to the merits of CompletableFuture vs Future - everyone seems to 
agree that CompletableFuture is a good choice for 3.x which addresses your 
original concerns. Everyone also agrees that CompletableFuture is obviously out 
for 2.x and Guava's ListenableFuture is a bad idea. The remaining point of 
disagreement appears to be whether we should expose a Future-based API for 2.x 
or do nothing. Since some downstream developers have expressed an interest in 
trying out a 2.x Future-based API even if it's tagged as Unstable/Experimental, 
is there a compelling reason to deny it? If Future turns to be of no use to 
anyone we can evolve the API in a later 2.x release or just revert it 
completely while the way forward (3.x) remains unaffected.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328974#comment-15328974
 ] 

stack commented on HDFS-9924:
-

Your summary and characterization of where the discussion is at is not correct 
[~arpit99]. The discussion is ongoing still (CompletableFuture is a significant 
undertaking, ListenableFuture copied local or something like is a possible 
candidate, etc.)

bq. Since some downstream developers have expressed an interest in trying out a 
2.x Future-based API even if it's tagged as Unstable/Experimental, is there a 
compelling reason to deny it?

I'd hope that it takes more than 'interest' to get code committed to HDFS.

bq. If Future turns to be of no use to anyone we can evolve the API in a later 
2.x release or just revert it completely while the way forward (3.x) remains 
unaffected.

If a technical argument on why Future will fix a codebases's scaling problem 
can't be produced, we can just skip the above evolutions and reverts altogether.


> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15329291#comment-15329291
 ] 

Steve Loughran commented on HDFS-9924:
--

Stack. I said it, You don't need any other opinions :)

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-14 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330056#comment-15330056
 ] 

Arpit Agarwal commented on HDFS-9924:
-

There are multiple comments from both sides indicating that CompletableFuture 
is the ideal option for 3.x.

bq. I'd hope that it takes more than 'interest' to get code committed to HDFS.
You mean just like we recently added 'avoid local nodes' because another 
downstream component wanted to try it? :)

bq. If a technical argument on why Future will fix a codebases's scaling 
problem can't be produced,
[~stack] what kind of argument are you looking for? The Hive engineers think 
they can make it work for them and there was a compromise proposed to introduce 
the API as unstable. So what is the compelling argument against this approach?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330173#comment-15330173
 ] 

stack commented on HDFS-9924:
-

bq. There are multiple comments from both sides indicating that 
CompletableFuture is the ideal option for 3.x.

[~arpiagariu] Please leave off concluding a discussion that is still ongoing 
(CF is not 'ideal' and is not a given). It doesn't help sir.

bq. You mean just like we recently added 'avoid local nodes' because another 
downstream component wanted to try it? 

You misrepresent, again. HBase ran for years with a workaround while waiting on 
the behavior to show up in HDFS; i.e. the hbase project did not have an 
'interest' in 'avoid local nodes'; they required this behavior of the 
filesystem and ran with a suboptimal hack until it showed up.

In this case all we have is 'interest' and requests for technical justification 
go unanswered.

bq. The Hive engineers think they can make it work for them and there was a 
compromise proposed to introduce the API as unstable.

I'm interested in how Hive will do async w/ only a Future and in how this 
suboptimal API in particular will solve their issue (is it described 
anywhere?). In my experience, a bunch of rigging (threads) for polling, rather 
than notification, is required when all you have is a Future to work with.




> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330655#comment-15330655
 ] 

Ashutosh Chauhan commented on HDFS-9924:


We can use simple Future by making lots of calls in the loop and collect an 
array of Futures and then call #Future.get in a loop. This simple usage solves 
our problem with making synchronous calls and wait for full roundtrip latency 
of each call. Threadpool has high overhead. Each thread needs 1MB memory, and 
1000 threads need 1GB which is non-trivial. Additionally, callback is not 
needed for Hive use case.

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-14 Thread Xiaobing Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330715#comment-15330715
 ] 

Xiaobing Zhou commented on HDFS-9924:
-

bq. In my experience, a bunch of rigging (threads) for polling, rather than 
notification, is required when all you have is a Future to work with.

This is not true. There's no need to do threaded polling. You can look at 
TestDiskBalancerCommand#testConcurrentAsyncRename on how to simply use it, 
e.g.: 
{code}
Map> retFutures = new HashMap>();
for (int i = 0; i < NUM_TESTS; i++) {
  for (;;) {
try {
  Future retFuture = adfs.rename(srcs[i], dsts[i], 
Rename.OVERWRITE);
  retFutures.put(i, retFuture);
  break;
} catch (AsyncCallLimitExceededException e) {
  /**
   * reached limit of async calls, fetch results of finished async calls
   * to let follow-on calls go
   */
  start = end;
  end = i;
  waitForReturnValues(retFutures, start, end);
}
  }
}
waitForReturnValues(retFutures, end, NUM_TESTS);

  void waitForReturnValues(final Map> retFutures,
  final int start, final int end) throws InterruptedException, 
ExecutionException {
for (int i = start; i < end; i++) {
  retFutures.get(i).get();
}
  }
{code}

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9924) [umbrella] Asynchronous HDFS Access

2016-06-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330867#comment-15330867
 ] 

stack commented on HDFS-9924:
-

I see. Thank you. I see what you want now.

You just need renames or you need more than rename? You want to do thousands of 
concurrent renames this way? Is that even going to work? Are you going to knock 
over the NN? Or, aren't you just have a bunch of outstanding calls blocked on 
remote NN locks? Won't you want to constrict how many ongoing calls there are?

> [umbrella] Asynchronous HDFS Access
> ---
>
> Key: HDFS-9924
> URL: https://issues.apache.org/jira/browse/HDFS-9924
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: fs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Xiaobing Zhou
> Attachments: AsyncHdfs20160510.pdf
>
>
> This is an umbrella JIRA for supporting Asynchronous HDFS Access.
> Currently, all the API methods are blocking calls -- the caller is blocked 
> until the method returns.  It is very slow if a client makes a large number 
> of independent calls in a single thread since each call has to wait until the 
> previous call is finished.  It is inefficient if a client needs to create a 
> large number of threads to invoke the calls.
> We propose adding a new API to support asynchronous calls, i.e. the caller is 
> not blocked.  The methods in the new API immediately return a Java Future 
> object.  The return value can be obtained by the usual Future.get() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >