[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/4519/9/be/src/common/global-flags.cc File be/src/common/global-flags.cc: Line 116: "to a thread pool to finish setup, reducing the chances that connections time out " > what's the rationale for making this optional and leaving the old logic in While we're confident in this implementation, it's not a bad idea to have a get-out-of-jail flag until we've seen it be successful in real-world deployments. This only controls which kind of object gets created, so keeping it doesn't add much complexity over removing it. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/4519/9/be/src/common/global-flags.cc File be/src/common/global-flags.cc: Line 116: "to a thread pool to finish setup, reducing the chances that connections time out " what's the rationale for making this optional and leaving the old logic in place? http://gerrit.cloudera.org:8080/#/c/4519/9/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: Line 22: #ifndef _THRIFT_SERVER_TACCEPTQUEUESERVER_H_ change file name to reflect our naming conventions -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has submitted this change and it was merged. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. It also adds a metric, connection-setup-queue-size, to monitor the number of accepted connections waiting to be processed. A flag, --accepted_cnxn_queue_depth, controls the size of the accepted connection buffer. Testing: - New test added to thrift-server-test. (Disabled by default, due to high ulimit -n requirement) - Locally with the repro shown in IMPALA-4135. - On the 16-node with a real repro query. - Ran the stress test for a while. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Reviewed-on: http://gerrit.cloudera.org:8080/4519 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc M be/src/rpc/thrift-server.h M common/thrift/metrics.json 8 files changed, 545 insertions(+), 14 deletions(-) Approvals: Henry Robinson: Looks good to me, approved Internal Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 8: Code-Review+2 Rebase (plus add a flag and disable the test), carry +2. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Internal Jenkins has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 8: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has uploaded a new patch set (#8). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. It also adds a metric, connection-setup-queue-size, to monitor the number of accepted connections waiting to be processed. A flag, --accepted_cnxn_queue_depth, controls the size of the accepted connection buffer. Testing: - New test added to thrift-server-test. (Disabled by default, due to high ulimit -n requirement) - Locally with the repro shown in IMPALA-4135. - On the 16-node with a real repro query. - Ran the stress test for a while. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc M be/src/rpc/thrift-server.h M common/thrift/metrics.json 8 files changed, 545 insertions(+), 14 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/8 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/4519/6//COMMIT_MSG Commit Message: PS6, Line 25: a whil > a while Done -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Hello Henry Robinson, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4519 to look at the new patch set (#7). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. It also adds a metric, connection-setup-queue-size, to monitor the number of accepted connections waiting to be processed. Testing: - New test added to thrift-server-test. - Locally with the repro shown in IMPALA-4135. - On the 16-node with a real repro query. - Ran the stress test for a while. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc M be/src/rpc/thrift-server.h M common/thrift/metrics.json 8 files changed, 525 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/7 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 6: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/4519/6//COMMIT_MSG Commit Message: PS6, Line 25: awhile a while -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/4519/5//COMMIT_MSG Commit Message: PS5, Line 18: It also adds a metric, connection-setup-queue-size, to monitor the number : of accepted connections waiting to be processe > no longer true. Done http://gerrit.cloudera.org:8080/#/c/4519/5/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: PS5, Line 20: // significant changes noted inline below. : > strange line break in comment again Done PS5, Line 45: d then imme > "separate thread, asynchronously. " Done -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has uploaded a new patch set (#6). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. It also adds a metric, connection-setup-queue-size, to monitor the number of accepted connections waiting to be processed. Testing: - New test added to thrift-server-test. - Locally with the repro shown in IMPALA-4135. - On the 16-node with a real repro query. - Ran the stress test for awhile. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc M be/src/rpc/thrift-server.h M common/thrift/metrics.json 8 files changed, 525 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/6 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 5: (3 comments) This looks good to me. One last thing: we should think about how we're going to monitor this. Could you add metrics to this server to measure the queue size? You can add a "InitMetrics(MetricGroup*)" call to TAcceptQueueServer, called from ThriftServer. The most important metric would be a queue length one, incremented when the accept thread adds a connection to the queue, and decremented on the other side. http://gerrit.cloudera.org:8080/#/c/4519/5//COMMIT_MSG Commit Message: PS5, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135, but : it still needs to be tested in a real cluster. no longer true. Add a comment about what testing you did. http://gerrit.cloudera.org:8080/#/c/4519/5/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: PS5, Line 20: // with the : // significant changes noted inline below. strange line break in comment again PS5, Line 45: thread pool "separate thread, asynchronously. " -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 5: The patch passed an exhaustive run on Jenkins: http://sandbox.jenkins.cloudera.com/job/impala-umbrella-build-and-test/4911/ I have also now, with Mostafa's help, come up with a query that reliably reproduces the issue on the 16 node cluster and succeeds on the 16 node with this patch deployed: select /* +straight_join */ count(*) from tpch_300_parquet.orders A join /* +shuffle */ tpch_300_parquet.orders B on A.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders C on c.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders D on d.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders E on e.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders F on f.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders G on g.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders H on h.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders I on i.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders J on j.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders K on k.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders L on l.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders M on m.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders N on n.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders O on o.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders P on p.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders Q on q.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders R on r.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders S on s.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders T on t.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders U on u.o_orderkey = B.o_orderkey join /* +shuffle */ tpch_300_parquet.orders V on v.o_orderkey = B.o_orderkey where a.o_orderkey < 10; Currently working on testing it with Kerberos on. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.cpp File be/src/rpc/TAcceptQueueServer.cpp: PS4, Line 136: SetupCon > doAccept() isn't a great name (I know it was from my patch!) because this i Done PS4, Line 204: rift code > connection_handler_pool, or something similar. Done PS4, Line 205: texpr int CONNECTION_SE > This can be defined in this method as a constexpr. Done http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: PS4, Line 96: New - > strange line breaks in this comment. Done PS4, Line 97: // up the connection and starting a thread to > move into .cc file Done http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/thrift-server-test.cc File be/src/rpc/thrift-server-test.cc: PS4, Line 173: ManyConcurrent > Call this ManyConcurrentConnection or something. Done Line 175: // waiting to be accepted. (IMPALA-4135) > Mention that this does not always fail. How long does it take to run? It takes about 2 seconds on my machine. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has uploaded a new patch set (#5). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. This patch has been tested locally with the repro shown in IMPALA-4135, but it still needs to be tested in a real cluster. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc 6 files changed, 459 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/5 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 4: (7 comments) Looking pretty good. http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.cpp File be/src/rpc/TAcceptQueueServer.cpp: PS4, Line 136: doAccept doAccept() isn't a great name (I know it was from my patch!) because this isn't actually doing the work of accept, but instead establishing a connection. Perhaps call this SetupConnection() ? You can use Impala's convention for method capitalisation; again makes it easier to see what was added. PS4, Line 204: acceptPool connection_handler_pool, or something similar. PS4, Line 205: ACCEPT_THREAD_POOL_SIZE This can be defined in this method as a constexpr. http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: PS4, Line 96: thread strange line breaks in this comment. PS4, Line 97: static const int ACCEPT_THREAD_POOL_SIZE = 1; move into .cc file http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/thrift-server-test.cc File be/src/rpc/thrift-server-test.cc: PS4, Line 173: ConnectTimeout Call this ManyConcurrentConnection or something. Line 175: // waiting to be accepted.(IMPALA-4135) Mention that this does not always fail. How long does it take to run? -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 4: (10 comments) http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.cpp File be/src/rpc/TAcceptQueueServer.cpp: PS3, Line 204: > this is an important enough parameter that I'd make a constant for it, and Done PS3, Line 216: > If this returns false it's because the queue is shut down. Better to say th Done PS3, Line 218: break; > set stop_ to true to cleanup? Done http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: Line 46: * > add a reference to IMPALA-4135 here. Done PS3, Line 105: > Is this used anywhere? Done http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/thrift-server-test.cc File be/src/rpc/thrift-server-test.cc: Line 174: // Test that a large number of concurrent connections will all succeed and not time out > Add a comment about what you're testing and the associated JIRA. Done PS3, Line 178: _OK(server->Start( > Does this work in a threaded context - does the test fail if the thread hit Yes, although the test runs to completion before reporting the error. PS3, Line 180: ThreadPool pool( > I'm a bit concerned about failing with ulimit errors, particularly because Agreed, this test has a lot of potential for flakiness or other pain. Not sure how else to do an automatic test of this issue, though. As written, it doesn't even actually fail on my machine without the fix, unless I add an artificial slowdown. PS3, Line 173: TEST(ConcurrencyTest, ConnectTimeout) { : // Test that a large number of concurrent connections will all succeed and not time out : // waiting to be accepted.(IMPALA-4135) : int port = GetServerPort(); : ThriftServer* server = new ThriftServer("DummyServer", MakeProcessor(), port); : ASSERT_OK(server->Start()); : : ThreadPool pool( : "group", "test", 256, 1, [port](int tid, const int64_t& item) { : > This relies on a running impala internal service - you probably had an Impa Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: PS1, Line 233: > Oops, sorry about that. No problem. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has uploaded a new patch set (#4). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. This patch has been tested locally with the repro shown in IMPALA-4135, but it still needs to be tested in a real cluster. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc 6 files changed, 457 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/4 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Juan Yu has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.cpp File be/src/rpc/TAcceptQueueServer.cpp: PS3, Line 218: set stop_ to true to cleanup? http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: PS1, Line 233: if (stop_) { > Its not in the while loop. Oops, sorry about that. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 3: (8 comments) Some comments before I head off to Strata. I would run clang-format over the new thrift files. That way they'll be more readable to Impala developers, while keeping the structure and idioms of the thrift code. http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.cpp File be/src/rpc/TAcceptQueueServer.cpp: PS3, Line 204: 1 this is an important enough parameter that I'd make a constant for it, and put your comments about only using thread on that. We don't want someone changing the number of threads without understanding the thread-safety implications. PS3, Line 216: acceptPool.Offer returned false. If this returns false it's because the queue is shut down. Better to say that clearly in the error message so users have an idea what this might mean. Also you can say that it's unexpected - we never shut down our servers so the queue should never be shut down. http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.h File be/src/rpc/TAcceptQueueServer.h: Line 46: * connections will time out while in the OS accept queue. add a reference to IMPALA-4135 here. PS3, Line 105: std::mutex big_lock_; Is this used anywhere? http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/thrift-server-test.cc File be/src/rpc/thrift-server-test.cc: Line 174: ThreadPool pool("group", "test", 256, 1, [](int tid, const int64_t& item) { Add a comment about what you're testing and the associated JIRA. PS3, Line 178: ASSERT_OK(status); Does this work in a threaded context - does the test fail if the thread hits an ASSERT? PS3, Line 180: for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i); I'm a bit concerned about failing with ulimit errors, particularly because both client and server should be in the same process. Have you ever seen this repro with fewer concurrent connections - say 8K? PS3, Line 173: TEST(ConcurrencyTest, ConnectTimeout) { : ThreadPool pool("group", "test", 256, 1, [](int tid, const int64_t& item) { : using Client = ThriftClient; : Client* client = new Client("127.0.0.1", 22000, "", NULL, false); : Status status = client->Open(); : ASSERT_OK(status); : }); : for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i); : pool.DrainAndShutdown(); : } This relies on a running impala internal service - you probably had an Impala process running when you ran this? That won't be true in our builds. Use the same server code that all the other tests use to start a server in this process. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has uploaded a new patch set (#3). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. This patch has been tested locally with the repro shown in IMPALA-4135, but it still needs to be tested in a real cluster. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc 6 files changed, 449 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/3 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: (12 comments) The main problem with subclassing TThreadedServer is that TThreadedServer::Task is defined in the thrift cpp file, so we can't access it. We could just copy that part, but between that, the copy of serve() that we have to make, and boiler plate around the constructors that would be necessary, subclasses doesn't really save any copied code other than init() and stop(), so it seems cleaner to me to keep the relationships between the classes simple by just copying everything rather than subclassing. I'm not sure what to do about formatting - currently, TAcceptQueueServer.h/cpp are written in Thrift's style (eg. wrapped lines indented 2 spaces rather than 4), though I may have gotten some of it wrong. Of course, I can switch it to Impala style if desired. For the cluster repro - I've crafted a query that consistently gets the "timed out" error on the 16 node cluster without the fix. However, with the fix, after running somewhat longer, it fails with a 'memory limit exceeded' error. Obviously it would be preferable to have a query that both repros and runs to completion on a fixed cluster, so I'm working on that, but it does seem like the patch works. http://gerrit.cloudera.org:8080/#/c/4519/1//COMMIT_MSG Commit Message: PS1, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135 > Can you add the test, modified if necessary? Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/rpc/thrift-server.cc File be/src/rpc/thrift-server.cc: Line 419: server_.reset(new TAcceptQueueThreadedServer(processor_, server_socket, > let's make the choice of server configurable by a flag that we can turn off Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/CMakeLists.txt File be/src/server/CMakeLists.txt: PS1, Line 24: add_library(ThriftAcceptQueueThreadedServer : TAcceptQueueThreadedServer.cpp : ) : add_dependencies(ThriftAcceptQueueThreadedServer thrift-deps) > Was there a reason you wanted this as a separate library? I think it's fine Done http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: Line 144: inputTransportFactory_->getTransport(client); > do you see much slowdown in synchronizing around these few lines that creat Most of this is fast and synchronizing it doesn't affect performance much. Line 164: shared_ptr thread = shared_ptr(threadFactory_->newThread(runnable)); > Is this definitely thread-safe? I think this might be the bottleneck (or ma The bottleneck is 'thread->start()' below. I'm fairly confident that's thread safe, as essentially all it does is create a new boost::thread. Line 200: ThreadPool> acceptPool("server-accept", "accept-worker", 1, 1, > How did you arrive at 1 as the default number of worker threads? What happe I saw the same behavior as you, where using more than 1 thread here was actually slower than just 1. I don't know why that's happening (Michael's fix for IMPALA-4026 currently under review doesn't seem to fix it), but given that just using 1 thread was enough to solve the problem for the repro in the JIRA, and given that only using one prevents us from having to worry about thread safety, it seems like the best choice, assuming that it also works in a real cluster repro. I could also make it configurable. PS1, Line 211: Offer > This can return false, right? Done PS1, Line 233: if (stop_) { > shouldn't this be out of the while loop? to guarantee a proper cleanup. Its not in the while loop. Line 236: acceptPool.DrainAndShutdown(); > I'm not sure we ever actually stop the thrift servers, so it might be moot Done PS1, Line 252: stop_ = false > why set stop_ back to false? I guess so that it can be restarted? (this was copied from the thrift code, and I think its better to leave it as close to the original behavior as possible) http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h File be/src/server/TAcceptQueueThreadedServer.h: PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the : // significant changes noted inline below. : /* : * Licensed to the Apache Software Foundation (ASF) under one : * or more contributor license agreements. See the NOTICE file : * distributed with this work for additional information : * regarding copyright ownership. The ASF licenses this file : * to you under the Apache License, Version 2.0 (the : * "License"); you may not use this file except in compliance : * with the License. You may obta
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has uploaded a new patch set (#2). Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. This patch has been tested locally with the repro shown in IMPALA-4135, but it still needs to be tested in a real cluster. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/src/common/global-flags.cc M be/src/rpc/CMakeLists.txt A be/src/rpc/TAcceptQueueServer.cpp A be/src/rpc/TAcceptQueueServer.h M be/src/rpc/thrift-server-test.cc M be/src/rpc/thrift-server.cc 6 files changed, 451 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/2 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Juan Yu has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: PS1, Line 211: Offer This can return false, right? shouldn't we handle the case it returns false? -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Juan Yu has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: (2 comments) code style, we use 4 space indent for continued line http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: PS1, Line 233: if (stop_) { shouldn't this be out of the while loop? to guarantee a proper cleanup. PS1, Line 252: stop_ = false why set stop_ back to false? -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Juan Yu Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Henry Robinson has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: (7 comments) I think Sailesh's suggestion is a good one. Is there any state in TThreadedServer that can't be accessed by a subclass? I agree that a separate library seems unnecessary. There doesn't seem any need for a separate top-level dir; I expected a new directory under rpc/. But the simplest is probably just to put the file in rpc/ and be done with it. http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/rpc/thrift-server.cc File be/src/rpc/thrift-server.cc: Line 419: server_.reset(new TAcceptQueueThreadedServer(processor_, server_socket, let's make the choice of server configurable by a flag that we can turn off if needed. http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp File be/src/server/TAcceptQueueThreadedServer.cpp: Line 144: inputTransportFactory_->getTransport(client); do you see much slowdown in synchronizing around these few lines that create protocols, transports and processors? If not, let's do that to make sure we rule out any possibility of race conditions. Line 164: shared_ptr thread = shared_ptr(threadFactory_->newThread(runnable)); Is this definitely thread-safe? I think this might be the bottleneck (or maybe it's start()?) so it's going to be useful to keep unsynchronized. Line 200: ThreadPool> acceptPool("server-accept", "accept-worker", 1, 1, How did you arrive at 1 as the default number of worker threads? What happens as that number increases? If it *has* to be one, to keep doAccept() thread-safe, add a comment here to that effect. Line 236: acceptPool.DrainAndShutdown(); I'm not sure we ever actually stop the thrift servers, so it might be moot - but shouldn't you signal to the worker threads that they should shut down and therefore not create any new threads? http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h File be/src/server/TAcceptQueueThreadedServer.h: PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the : // significant changes noted inline below. : /* : * Licensed to the Apache Software Foundation (ASF) under one : * or more contributor license agreements. See the NOTICE file : * distributed with this work for additional information : * regarding copyright ownership. The ASF licenses this file : * to you under the Apache License, Version 2.0 (the : * "License"); you may not use this file except in compliance : * with the License. You may obtain a copy of the License at : * : * http://www.apache.org/licenses/LICENSE-2.0 : * : * Unless required by applicable law or agreed to in writing, : * software distributed under the License is distributed on an : * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY : * KIND, either express or implied. See the License for the : * specific language governing permissions and limitations : * under the License. : */ > I think this is fine, but if you haven't already, can you ask Jim or Henry AFAIK it's completely fine - copying from another ASF project is legit. That said, I'd keep the license header as the first thing in the file (and if you subclass, so much the better). PS1, Line 44: and effectively creating a new, : * internally managed, accept queue. not exactly - the connections have been accepted, the next state is transport set-up. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: (3 comments) Thanks! Not a full review yet. I still need to spend time reading TAcceptQueueThreadedServer.cpp carefully. http://gerrit.cloudera.org:8080/#/c/4519/1//COMMIT_MSG Commit Message: PS1, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135 Can you add the test, modified if necessary? http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/CMakeLists.txt File be/src/server/CMakeLists.txt: PS1, Line 24: add_library(ThriftAcceptQueueThreadedServer : TAcceptQueueThreadedServer.cpp : ) : add_dependencies(ThriftAcceptQueueThreadedServer thrift-deps) Was there a reason you wanted this as a separate library? I think it's fine to put this code in rpc/ since that's where we have our thrift-server. http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h File be/src/server/TAcceptQueueThreadedServer.h: PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the : // significant changes noted inline below. : /* : * Licensed to the Apache Software Foundation (ASF) under one : * or more contributor license agreements. See the NOTICE file : * distributed with this work for additional information : * regarding copyright ownership. The ASF licenses this file : * to you under the Apache License, Version 2.0 (the : * "License"); you may not use this file except in compliance : * with the License. You may obtain a copy of the License at : * : * http://www.apache.org/licenses/LICENSE-2.0 : * : * Unless required by applicable law or agreed to in writing, : * software distributed under the License is distributed on an : * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY : * KIND, either express or implied. See the License for the : * specific language governing permissions and limitations : * under the License. : */ I think this is fine, but if you haven't already, can you ask Jim or Henry if there's anything else we have to do for legal? -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Sailesh Mukil has posted comments on this change. Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. Patch Set 1: Have you considered making TAcceptQueueThreadedServer inherit from TThreadedServer and just overloading the serve() functions, and adding the doAccept()? The rest of the code looks mostly the same. -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Sailesh Mukil Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load
Thomas Tauber-Marshall has uploaded a new change for review. http://gerrit.cloudera.org:8080/4519 Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load .. IMPALA-4135: Thrift threaded server times-out connections during high load During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections, causing some to time out. This patch creates TAcceptQueueThreadedServer, which is a modified version of TThreadedServer that calls accept() and then hands the returned TTransport off to a thread pool to handle setting up the connection. This ensures that accept() is called as quickly as possible, preventing connections from timing out while waiting. This patch has been tested locally with the repro shown in IMPALA-4135, but it still needs to be tested in a real cluster. Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d --- M be/CMakeLists.txt M be/src/rpc/thrift-server.cc A be/src/server/CMakeLists.txt A be/src/server/TAcceptQueueThreadedServer.cpp A be/src/server/TAcceptQueueThreadedServer.h 5 files changed, 449 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/1 -- To view, visit http://gerrit.cloudera.org:8080/4519 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall