[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741935#comment-14741935 ] Duo Zhang commented on HDFS-7966: - No, the testcase uses multiple connections... But yes, this is not a typical usage in real world. Let me try to deploy an HBase on top of HDFS and run YCSB to collect some performance data. Thanks. > New Data Transfer Protocol via HTTP/2 > - > > Key: HDFS-7966 > URL: https://issues.apache.org/jira/browse/HDFS-7966 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Haohui Mai >Assignee: Qianqian Shi > Labels: gsoc, gsoc2015, mentor > Attachments: GSoC2015_Proposal.pdf, > TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, > TestHttp2ReadBlockInsideEventLoop.svg > > > The current Data Transfer Protocol (DTP) implements a rich set of features > that span across multiple layers, including: > * Connection pooling and authentication (session layer) > * Encryption (presentation layer) > * Data writing pipeline (application layer) > All these features are HDFS-specific and defined by implementation. As a > result it requires non-trivial amount of work to implement HDFS clients and > servers. > This jira explores to delegate the responsibilities of the session and > presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles > connection multiplexing, QoS, authentication and encryption, reducing the > scope of DTP to the application layer only. By leveraging the existing HTTP/2 > library, it should simplify the implementation of both HDFS clients and > servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741145#comment-14741145 ] Haohui Mai commented on HDFS-7966: -- I never understand the the performance numbers. What does 4% means in the data? Do you repeat the experiment multiple times to get a 95% confidence intervals? Can you please explain them a little bit more? A chart would definitely help. It looks to me that the test case is only having a single connection performing reads for a single block? In production it is there are will be tens of thousands on concurrent reads. Does the thread pool help? How does the current implementation look like in this scenario? > New Data Transfer Protocol via HTTP/2 > - > > Key: HDFS-7966 > URL: https://issues.apache.org/jira/browse/HDFS-7966 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Haohui Mai >Assignee: Qianqian Shi > Labels: gsoc, gsoc2015, mentor > Attachments: GSoC2015_Proposal.pdf, > TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, > TestHttp2ReadBlockInsideEventLoop.svg > > > The current Data Transfer Protocol (DTP) implements a rich set of features > that span across multiple layers, including: > * Connection pooling and authentication (session layer) > * Encryption (presentation layer) > * Data writing pipeline (application layer) > All these features are HDFS-specific and defined by implementation. As a > result it requires non-trivial amount of work to implement HDFS clients and > servers. > This jira explores to delegate the responsibilities of the session and > presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles > connection multiplexing, QoS, authentication and encryption, reducing the > scope of DTP to the application layer only. By leveraging the existing HTTP/2 > library, it should simplify the implementation of both HDFS clients and > servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736658#comment-14736658 ] Duo Zhang commented on HDFS-7966: - Netty-4.1.0Beta6 is out so I'm back. I have added a simple {{asyncRead}} method(not fully asynchronous since this is only a POC) to {{DFSInputStream}} and write a performance test for it. Here is the test result(two times for every test) {noformat} ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest async /test 100 5 4096 // 100 here means max concurrency which used to prevent OOM. *** time based on http2 230946 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest async /test 100 5 4096 *** time based on http2 231066 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 100 5 4096 pread *** time based on tcp 231410 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 100 5 4096 pread *** time based on tcp 231038 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 100 5 4096 pread *** time based on http2 236069 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 100 5 4096 pread *** time based on http2 231773 {noformat} The performance difference is ~±4% and async is a little better than tcp. Thanks. > New Data Transfer Protocol via HTTP/2 > - > > Key: HDFS-7966 > URL: https://issues.apache.org/jira/browse/HDFS-7966 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Haohui Mai >Assignee: Qianqian Shi > Labels: gsoc, gsoc2015, mentor > Attachments: GSoC2015_Proposal.pdf, > TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, > TestHttp2ReadBlockInsideEventLoop.svg > > > The current Data Transfer Protocol (DTP) implements a rich set of features > that span across multiple layers, including: > * Connection pooling and authentication (session layer) > * Encryption (presentation layer) > * Data writing pipeline (application layer) > All these features are HDFS-specific and defined by implementation. As a > result it requires non-trivial amount of work to implement HDFS clients and > servers. > This jira explores to delegate the responsibilities of the session and > presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles > connection multiplexing, QoS, authentication and encryption, reducing the > scope of DTP to the application layer only. By leveraging the existing HTTP/2 > library, it should simplify the implementation of both HDFS clients and > servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652579#comment-14652579 ] Haohui Mai commented on HDFS-7966: -- bq. What's the upside of this new implementation? Performance is definitely one important factor. One of the motivation is to improve the efficiency of DN when there are hundreds of thousands of reads by reducing the overhead of context switches. [~Apache9], do you have any performance numbers on this scenario? HTTP/2-based DTP also serves as a building block of the next-level of innovation, just to quote the description in the jira: {quote} This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. {quote} bq. If it were the same performance but had other redeeming qualities (e.g. less code) then it's still worth consideration. This is designed to be a new code path so that it is compatible with older releases. You can still rely on the old DTP protocol depending on the application scenario. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, TestHttp2ReadBlockInsideEventLoop.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652698#comment-14652698 ] Duo Zhang commented on HDFS-7966: - I do not have enough machines to test the scenario... What I see if I create lots of thread to read from datanode concurrently is that HTTP/2 will start the request almost at the same time, but TCP will start the request one by one(maybe tens by tens where the number is cpu count). So there won't be a situation that DN really handle lots of concurrent read from client, and the context switch maybe small than HTTP/2 implementation since we also have a ThreadPool besides EventLoopGroup in HTTP/2 connection. And what make things worse is that our client is not event driven so we can not reduce the thread count of client... Let me see if I can make a scenario that HTTP/2 fast than TCP... Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, TestHttp2ReadBlockInsideEventLoop.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652657#comment-14652657 ] Andrew Wang commented on HDFS-7966: --- Agree there potentially are performance advantages, but it looks like all the benchmarks thus far show worse performance. I'd be very happy to see positive results, since erasure coding will lead to a lot more remote reads and thus possibly hitting this code path. There has to be some upside though for this to be merged. The existing DTP already implements a number of the features mentioned, so not sure how much we gain there. And if perf isn't as good or better, then we're increasing our maintenance burden for something that won't get used. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, TestHttp2ReadBlockInsideEventLoop.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652549#comment-14652549 ] Andrew Wang commented on HDFS-7966: --- I guess my question here is similar to what [~stack] and [~tlipcon] posed at the beginning. What's the upside of this new implementation? Seems like it's between 10 to 30% slower than the current implementation, which is not good. If it were the same performance but had other redeeming qualities (e.g. less code) then it's still worth consideration. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, TestHttp2ReadBlockInsideEventLoop.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651021#comment-14651021 ] Duo Zhang commented on HDFS-7966: - I modified {{Http2ConnectionPool}} to allow creating multiple HTTP/2 connections to one datanode(default is 10). Here is the test result {noformat} ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 10 10 1024 pread *** time based on tcp 38343 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 10 10 1024 pread *** time based on http2 45799 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 100 1 1024 pread *** time based on tcp 20206 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 100 1 1024 pread *** time based on http2 21980 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 500 2000 1024 pread *** time based on tcp 20146 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 500 2000 1024 pread *** time based on http2 22461 {noformat} HTTP/2 is 19%, 9% and 11% slower than TCP. Notice that the {{DFSClient}} is not event-driven thus we have more threads when using HTTP/2 at client side, so I think the performance here is acceptable? We could introduce a new event-driven {{FileSystem}}(maybe like HDFS-8707?) later to improve client performance. The performance testing is almost done here. Next I will begin to pick code from POC branch to HDFS-7966 branch. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, TestHttp2ReadBlockInsideEventLoop.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650242#comment-14650242 ] Duo Zhang commented on HDFS-7966: - OK I'm back. Here is the test result of HTTP/2 that remove context-switch overhead. See the 'noswitch' part in https://github.com/Apache9/hadoop/blob/HDFS-7966-POC/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/http2/PerformanceTest.java And I also remove the thread pool in ReadBlockHandler. {noformat} ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 1 100 1024 pread *** time based on tcp 260776 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 1 100 1024 pread *** time based on http2 301257 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest noswitch /test 100 1024 *** time based on http2 264012 {noformat} (264012 - 260776) / 260776 = 0.012 So if I remove context-switch, HTTP/2 is only 1.2% slower than TCP. Of course it is not acceptable to write code like this in real production. It is only used to prove that context-switch is the primary overhead. And in fact, although HTTP/2 is about 30% slower than TCP in this case, it is still fast enough I think? It is only 0.32ms per read, so maybe it is acceptable? Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg, TestHttp2ReadBlockInsideEventLoop.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633089#comment-14633089 ] Duo Zhang commented on HDFS-7966: - Write a single threaded testcase that do all the test works inside event loop. https://github.com/Apache9/hadoop/blob/HDFS-7966-POC/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/web/dtp/TestHttp2ReadBlockInsideEventLoop.java And at server side, I remove the thread pool in {{ReadBlockHandler}}. The result is {noformat} *** time based on tcp 17734ms *** time based on http2 20019ms *** time based on tcp 18878ms *** time based on http2 21422ms *** time based on tcp 17562ms *** time based on http2 20568ms *** time based on tcp 18726ms *** time based on http2 20251ms *** time based on tcp 18632ms *** time based on http2 21227ms {noformat} The average time of original tcp is 18306.4ms, and HTTP/2 is 20697.4ms. 20697.4 / 18306.4 = 1.13, so HTTP/2 is 13% slower than tcp. In the above test it is 30% slower, so I think context switch maybe one of the problem why HTTP/2 is much slower than tcp. Will do this test on a real cluster to get more data. And the one {{EventLoop}} per datanode problem, I think it is a problem on a small cluster. So I think we should allow creating multiple HTTP/2 connections to one datanode. I will modify {{Http2ConnectionPool}} and do the test again. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627827#comment-14627827 ] Duo Zhang commented on HDFS-7966: - Small read using {{PerformanceTest}}. Unit is millisecond. {noformat} ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 1(thread number) 100(read count per thread) 1024(bytes per read) pread(use pread) {noformat} {noformat} ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 1 100 1024 pread *** time based on tcp 242730 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 1 100 1024 pread *** time based on http2 324491 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 10 10 1024 pread *** time based on tcp 40688 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 10 10 1024 pread *** time based on http2 82819 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 100 1 1024 pread *** time based on tcp 21612 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 100 1 1024 pread *** time based on http2 69658 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 500 2000 1024 pread *** time based on tcp 19931 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 500 2000 1024 pread *** time based on http2 151727 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 1000 1000 1024 pread *** time based on http2 251735 {noformat} For the single threaded test, 324491/242730=1.34, so http2 is 30% slow than tcp. Will try to find the overhead later. And for multi threaded test, http2 is much slow than tcp. And tcp failed the 1000 threads test. I think the problem is that I only use one connection in http2 so there is only one EventLoop(which means only one thread) which sends or receives data. And for tcp, the thread number is same with connection number. The {{%CPU}} of datanode when using http2 is always around 100% no matter the thread number is 10 or 100 or 1000. But when using tcp the {{%CPU}} could be higher than 1500% when the number of thread increasing. Next I will write new test which can use multiple http2 connections. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627512#comment-14627512 ] Duo Zhang commented on HDFS-7966: - I'd say sorry... The performance result above is useless since the flow control part of my code does not work at that time... I found it when I tried to transfer 512MB block-I got an OOM... I have rewritten the flow control part, and setup a cluster with 1 NN and DN to evaluate the performance. There is a netty bug(https://github.com/netty/netty/pull/3929) so I need to modify my code when running different tests. The performance test code is here https://github.com/Apache9/hadoop/blob/HDFS-7966-POC/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/http2/PerformanceTest.java First I ran a large read test with 1 file with a 1GB block. Each ran 5 times with the command {noformat} ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest http2 /test 1 1 1073741824 ./bin/hadoop org.apache.hadoop.hdfs.web.http2.PerformanceTest tcp /test 1 1 1073741824 {noformat} Note that I set {{dfs.datanode.transferTo.allowed}} to {{false}} since http2 implementation can not use transferTo(I'm currently working on implementing {{FileRegion}} support in netty-http2-codec, see https://github.com/netty/netty/issues/3927) The result is {noformat} *** time based on http2 9953 *** time based on http2 9967 *** time based on http2 9954 *** time based on http2 9985 *** time based on http2 9976 *** time based on tcp 9383 *** time based on tcp 9375 *** time based on tcp 9377 *** time based on tcp 9373 *** time based on tcp 9376 {noformat} The average latency of http2 is 9967ms, and for tcp it is 9376.8ms. 9967/9376.8=1.063, so http2 is about 6% slow than tcp. I think this is an acceptable result? Let me test small read later and post the result here. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623778#comment-14623778 ] Duo Zhang commented on HDFS-7966: - And there is a good news, in {{TestHttp2RandomReadPerformance}}, HTTP/2 implementation beats the old implementation. We create 4000 connections to one datanode, and use one thread to peek a connection and read a small chunk of data sequentially. The result is {noformat} *** time based on http2 124ms *** time based on tcp 274ms {noformat} And 5000 connections will cause OOM(can not create thread) in the tcp test. I think this is reasonable since NIO based framework has much less threads than OIO. I'm busy these days so only have some time at weekend. Will do these tests on a cluster ASAP. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2LargeReadPerformance.svg, TestHttp2Performance.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617005#comment-14617005 ] Haohui Mai commented on HDFS-7966: -- Thanks for the benchmark. Some questions: * Do you have an idea whether the overhead is coming from the server side or the client side implementation? Does it help when you're using {{curl}} as the client? * How does the size of read affect performance? How does HTTP/2 perform for large reads (e.g., a full block size)? New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2Performance.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617985#comment-14617985 ] Duo Zhang commented on HDFS-7966: - This is the worst scenario for testing a NIO framework I think. NIO is consider to have less threads than OIO, but in this test, NIO at least needs 4 threads and OIO only needs 2. You can see that the context switching costs a lot in the flame graph(ThreadPoolExecutor related operations, EventLoop.execute, selector.wakeup, etc.). And the buffer pooling here is also redundant. In OIO, one buffer for server and one buffer for client. At last, I think test through localhost can make things worse since now the network speed and latency are not bottleneck any more. I plan to test these things next: 1. Read a large block(256MB or more) 2. Simulate the scenario that datanode caches a lot of connections from different machine and only a few of them read at the same time. 3. Run all tests on a real cluster(which means read data from other machine). Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf, TestHttp2Performance.svg The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591434#comment-14591434 ] Duo Zhang commented on HDFS-7966: - https://github.com/Apache9/hadoop/tree/HDFS-7966-POC Will implement read block at this branch and collect some performance results first. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588688#comment-14588688 ] stack commented on HDFS-7966: - bq. The code path is completely separated so it should be few risks in terms of destabilizing the trunk. My concern is not so much destabilization. My concern is a bunch of new code that may never get used. Reviewing the patches so far, it seems like you fellas are working it out as you go. Nothing wrong with that. It just seems like something better done in a branch than in mainline. The justification for this work is a little nebulous. It has it that [DTP on HTTP/2] ...should simplify the implementation of both HDFS clients and servers. Apart from the fact that DTP is but a severe subset, the 'easy' part, of what an alternative client/server would have to implement, what if HTTP/2 complicates rather than simplifies new clients and servers? Better to figure this, and 'fit criteria' that prove it simplifies, on a branch I'd say. Also, why would folks move to using this new transport? Will it be more performant than current DTP? (I'd guess not... given HTTP/2 does a bunch of 'extras' and going by the PoC done over in HBase) When complete, we might have a bunch of new code that is slower than what is currently there and that folks are wary to try given it is 'new'. This state of affairs could go on such that the code is never exercised. I am suggesting a branch because there you work out implementation, perf characteristics, and answers to any questions such as the sample posed above and come merge time, you will have a more solid story to tell. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588913#comment-14588913 ] Todd Lipcon commented on HDFS-7966: --- +1 to what Stack said. We already have many access mechanisms to HDFS. Adding another, unless it has some clear value, just adds more code to maintain, more dependencies, etc. There should be some clear criteria for merging this into HDFS proper -- either that there's a strong performance argument or proof that it's substantially easier to implement this vs the existing protocol (eg if the new server/client support all the features of the existing client at the same performance but end up being substantially fewer lines of code to maintain). New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589122#comment-14589122 ] Haohui Mai commented on HDFS-7966: -- bq. Should we get Duo Zhang set up w/ commit rights on the branch? This is a good idea. I'll start the discussion and get things going. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589024#comment-14589024 ] Haohui Mai commented on HDFS-7966: -- Thanks for the inputs. I think that the arguments of a feature branch are valid. I'll create a feature branch shortly. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589040#comment-14589040 ] stack commented on HDFS-7966: - Lets me help if I can. Should we get [~Apache9] set up w/ commit rights on the branch? New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584160#comment-14584160 ] Haohui Mai commented on HDFS-7966: -- bq. Do you think this should go into a branch first? I think the development can be done in mainline. The code path is completely separated so it should be few risks in terms of destabilizing the trunk. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578453#comment-14578453 ] stack commented on HDFS-7966: - bq. Hopefully we can get things committed soon and continue to make progress. Do you think this should go into a branch first [~wheat9]? It is shaping up as a nice set of sizeable, related patches. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578303#comment-14578303 ] Haohui Mai commented on HDFS-7966: -- Sorry for the late reply. Get caught up by multiple things. bq. 2 - If you're using your own payload encoding then tagging flush points in a streaming RPC seems pretty trivial. Thanks very much for the information. Based on the information it looks like that it is possible to use a standard GRPC client to talk the new DTP protocol. It would save a lot of effort on implementing a client of the DTP protocol. I really appreciate if there are any pointers to the client / server code. bq. Generally interested in progress if any. No harm if none. Thanks. Currently [~Apache9] is making progress on HDFS-8515 and HDFS-8471. I have been closely working with [~Apache9] on this. Hopefully we can get things committed soon and continue to make progress. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556920#comment-14556920 ] stack commented on HDFS-7966: - bq. What I was saying is that I'm unsure whether a standard grpc client would be able to understand this variant. Sounds like it might ([~louiscryan] might be able to help out here if issues according to above). Generally interested in progress if any. No harm if none. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542307#comment-14542307 ] Louis Ryan commented on HDFS-7966: -- Hey, Im the one working on the grpc impl for HBase so I can help answer questions on that. 1 - grpc is a general purpose content type agnostic transport layer so you can happily use it without using proto. 2 - If you're using your own payload encoding then tagging flush points in a streaming RPC seems pretty trivial. If you'd like to try grpc let me know and I can provide pointers / answer questions New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540331#comment-14540331 ] stack commented on HDFS-7966: - For your consideration: On 1., are you sure this the case? There is a hacked up grpc poc over in hbaseland that implements hbase's metadata-in-pb-and-data-follows-behind. Here is client write: https://github.com/louiscryan/hbase/commit/e2dae9e8e3f648125f1e13e0ff88944935ad39a3#diff-f5daefd77cbc7bbca73187505c2a9c13R179 On 2., does the write have to be streaming? It can't be a sized write with metadata to say h*() when all received? Let me ask my migration question another way: We thinking this will be an incompatible change? Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540286#comment-14540286 ] Haohui Mai commented on HDFS-7966: -- Let me try to answer the questions. There are two reasons that why the proposal needs to diverge from grpc: (1) grpc requires the response to be fitted into a protobuf message where a large read request ( 2GB) fails to fit in, and (2) the write cannot be a single streaming rpc as the protocol needs to implement hflush() and hsync() as well. Note that evolving the read path is relatively straightforward as the implementation only needs to provide another implementation of {{BlockReader}}. The write path, however, might require implementing a new {{DFSOutputStream}}. There should be no new port required -- the plan is to listen on the HTTP/HTTPS port that is available on DN today. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540405#comment-14540405 ] Haohui Mai commented on HDFS-7966: -- Thanks for the info. I think that the proposal handles the reads the same the link you posted when streaming large read requests. What I was saying is that I'm unsure whether a standard grpc client would be able to understand this variant. For (2) I think it makes sense and I believe that this is what the current DataTransferProtocol is doing. We can definitely adopt this idea. bq. Let me ask my migration question another way: We thinking this will be an incompatible change? I'm probably missing something -- the new DTP will operate on the HTTPS port with a new URL, so it should be backward-compatible. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14540424#comment-14540424 ] stack commented on HDFS-7966: - bq. I'm probably missing something .. You answered. New URL. So you have to ask for it. Ok. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor Attachments: GSoC2015_Proposal.pdf The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537805#comment-14537805 ] zhangduo commented on HDFS-7966: The latest patch of HDFS-5270 proves that the logic of FsDataset and writing pipeline could be compatible with a thread pool implementation. But HDFS-5270 does not address the basic issue-one thread per connection(maybe two?). This makes client connection pooling which is very important for HBase impossible in large cluster. So I think it is time to pick up this issue. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502137#comment-14502137 ] stack commented on HDFS-7966: - Yes. Do you have pointers [~wheat9] on where to comment in GSoC site? Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497003#comment-14497003 ] Haohui Mai commented on HDFS-7966: -- There is one proposal from Qianqian Shi in the GSoC system. I don't know the details of the process, but it looks like it is being reviewed. [~stack] can you please help on it? Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493599#comment-14493599 ] stack commented on HDFS-7966: - Any progress on this issue? (A student associated?) Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379298#comment-14379298 ] Haohui Mai commented on HDFS-7966: -- Hi students, please take a look at the GSoC 2015 FAQ and submit a proposal. Thanks. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375371#comment-14375371 ] SIVA NAGARAJU commented on HDFS-7966: - I AM ALSO INTRESTED IN BEING PART OF IT. I KNOW HDFS AND MAP REDUCE PROGRAMMING New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372698#comment-14372698 ] Nisala Mendis commented on HDFS-7966: - Hi all, I am Nisala Nirmana, post graduate of University of Colombo, Sri Lanka specialized in computer and network secuirity. Also partly working as a research engineer at Dialog communications at University of Moratuwa Sri Lanka. I am interested in this project as I am more familiar with some of the technologies. I really appriciate some mentors out there provide me some more insight to the project. Regards nisala New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Qianqian Shi Labels: gsoc, gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7966) New Data Transfer Protocol via HTTP/2
[ https://issues.apache.org/jira/browse/HDFS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372233#comment-14372233 ] Qianqian Shi commented on HDFS-7966: It sounds like an awesome project. I'm interested in being part of it. New Data Transfer Protocol via HTTP/2 - Key: HDFS-7966 URL: https://issues.apache.org/jira/browse/HDFS-7966 Project: Hadoop HDFS Issue Type: New Feature Reporter: Haohui Mai Assignee: Haohui Mai Labels: gsoc2015, mentor The current Data Transfer Protocol (DTP) implements a rich set of features that span across multiple layers, including: * Connection pooling and authentication (session layer) * Encryption (presentation layer) * Data writing pipeline (application layer) All these features are HDFS-specific and defined by implementation. As a result it requires non-trivial amount of work to implement HDFS clients and servers. This jira explores to delegate the responsibilities of the session and presentation layers to the HTTP/2 protocol. Particularly, HTTP/2 handles connection multiplexing, QoS, authentication and encryption, reducing the scope of DTP to the application layer only. By leveraging the existing HTTP/2 library, it should simplify the implementation of both HDFS clients and servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)