Mostafa Mokhtar created IMPALA-5557: ---------------------------------------
Summary: Increase and stagger rpc_default_keepalive_time_ms Key: IMPALA-5557 URL: https://issues.apache.org/jira/browse/IMPALA-5557 Project: IMPALA Issue Type: Sub-task Components: Distributed Exec Reporter: Mostafa Mokhtar Queries were still hitting the KDC hard and eventually failed, this is the sequence of events: # Start running a query # Impala backends send thousands of TGS_REQ requests to the KDC # Query would appear to make some progress # Some connections succeed others fail with Timeout exceeded waiting to connect # Then I noticed in the logs that idle connections get kill after 65 seconds reactor.cc:281] Timing out connection server connection from 10.17.229.14:41804 - it has been idle for 65.0002s # Query takes about 2 minutes and fail # By then most connections are released since they were idle for > 65 seconds # New queries go through the same process again and eventually fail In order to reliably run on a large cluster I believe we need to: # Change the lifetime of idle connections, possibly extend it to ticket lifetime? # Create new connections before existing ones expire in a staggered fashion to avoid KDC related failures? This is the flag rpc_default_keepalive_time_ms -- This message was sent by Atlassian JIRA (v6.4.14#64029)