Mostafa Mokhtar created IMPALA-5557:
---------------------------------------

             Summary: Increase and stagger rpc_default_keepalive_time_ms
                 Key: IMPALA-5557
                 URL: https://issues.apache.org/jira/browse/IMPALA-5557
             Project: IMPALA
          Issue Type: Sub-task
          Components: Distributed Exec
            Reporter: Mostafa Mokhtar


Queries were still hitting the KDC hard and eventually failed, this is the 
sequence of events:
# Start running a query
# Impala backends send thousands of TGS_REQ requests to the KDC
# Query would appear to make some progress 
# Some connections succeed others fail with Timeout exceeded waiting to connect
# Then I noticed in the logs that idle connections get kill after 65 seconds
reactor.cc:281] Timing out connection server connection from 10.17.229.14:41804 
- it has been idle for 65.0002s
# Query takes about 2 minutes and fail
# By then most connections are released since they were idle for > 65 seconds
# New queries go through the same process again and eventually fail 

In order to reliably run on a large cluster I believe we need to:
# Change the lifetime of idle connections, possibly extend it to ticket 
lifetime?
# Create new connections before existing ones expire in a staggered fashion to 
avoid KDC related failures?

This is the flag rpc_default_keepalive_time_ms



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to