Congling Xia created KYLIN-4432:
-----------------------------------

             Summary: duplicated queries with sytax error take unexpect long 
time when lazy query enabled
                 Key: KYLIN-4432
                 URL: https://issues.apache.org/jira/browse/KYLIN-4432
             Project: Kylin
          Issue Type: Bug
            Reporter: Congling Xia


Hi! Our Kylin server becomes unresponsive recently. All query threads were 
being TIMED_WAITING so that no more thread in the thread pool of Tomcat could 
response new requests. The server logs are full of "Duplicate SQL request"s, 
and many requests retrying for so long time:
{code:java}
$ cat kylin.log.25 | grep 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991 | head -3
2020-03-23 12:15:32,279 INFO [Query 
14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
Duplicated SQL request is running, waiting...
2020-03-23 12:15:32,379 INFO [Query 
14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
Duplicated SQL request is running, waiting...
2020-03-23 12:15:32,479 INFO [Query 
14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
Duplicated SQL request is running, waiting...

$ cat kylin.log.25 | grep 14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991 | tail -3
2020-03-23 12:26:56,920 INFO [Query 
14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
Duplicated SQL request is running, waiting...
2020-03-23 12:26:57,020 INFO [Query 
14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
Duplicated SQL request is running, waiting...
2020-03-23 12:26:57,121 INFO [Query 
14db5fa0-78e0-4ca6-5d04-92995b2b8367-3133991] service.QueryService:571 : 
Duplicated SQL request is running, waiting...
{code}
This can be re-produced easily:
 * enable lazy query (KYLIN-2897) by setting environment variable 
kylin.query.cache-enabled=true and kylin.query.lazy-query-enabled=true
 * send the same query with syntax error many times in several minutes

The first query will quickly response with error message, and the following 
queries will executing for long time and will repeatly output "Duplicated SQL 
request is running, waiting...".

The code in org.apache.kylin.rest.service.QueryService#searchQueryInCache 
indicates duplicated query will wait at most one minute with default setting of 
LazyQueryWaitingTimeoutMilliSeconds but in fact it waits quite longer than that.
{code:java}
while (response.isRunning()) {
    // Wait at most one minute
    if (System.currentTimeMillis() - response.getLazyQueryStartTime() >= 
getConfig()
            .getLazyQueryWaitingTimeoutMilliSeconds()) {
        cache.evict(sqlRequest.getCacheKey());
        return null;
    }
    logger.info("Duplicated SQL request is running, waiting...");
    try {
        Thread.sleep(100L);
    } catch (InterruptedException e) {
    }
    wrapper = cache.get(sqlRequest.getCacheKey());
    if (wrapper == null) {
        return null;
    }
    response = (SQLResponse) wrapper.get();
    if (response == null) {
        return null;
    }
}
{code}
After some work-around, I find that the dummy response status remains unchanged 
after execution with exception. So the 2nd query will wait 
LazyQueryWaitingTimeoutMilliSeconds, evict the dummy response created by the 
1st query from cache, and put a new dummy response into the cache. And the 3rd 
query may use the new dummy response's start time to compute timeout, and will 
wait for another LazyQueryWaitingTimeoutMilliSeconds.

Suppose we send _n_ identical bad queries to the server quickly. The last query 
may wait for nearly _(n-1)_ times of LazyQueryWaitingTimeoutMilliSeconds. 
That's why the Web container's thread pool was exhausted – user program uses 
bad queries to check availability of Kylin service and retries many times in a 
short time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to