GWphua opened a new issue, #17475:
URL: https://github.com/apache/druid/issues/17475

   ### Affected Version
   
   Environments are run on Kubernetes clusters:
   Production env: 0.22.1
   Test env: 27
   
   ### Description
   Cluster Info: 2 Historicals, max heap 8G, direct memory size 16G. Limit 
4cpu, 24G memory.
   
   According to the 
[docs](https://druid.apache.org/docs/latest/configuration/#historical-query-configs),
 Historical servers can configure a query timeout which stops the queries. The 
default value is 300_000ms, which translates to 5min. 
   
   This problem is first encountered when my Historical servers are being 
overloaded with slow queries, and I want to set a timeout of 90s on Historicals 
to alleviate this problem. However, setting 
`druid.server.http.defaultQueryTimeout=90000` does not seem to work.
   
   Grafana dashboard reports still record Historical latencies of up to 8min, 
while Brokers show latencies of 5min, which shows that the default 5min 
`druid.server.http.defaultQueryTimeout` configuration is valid for the Brokers. 
(Maybe Historicals need more time to process query cancellations? Read from 
[Cancel a 
Query](https://druid.apache.org/docs/latest/api-reference/sql-api#cancel-a-query)
 that Druid does a "best-effort cancellation", but may need more details on how 
this 'best-effort' is conducted, and whether query cancellation trigger a stop 
Historical processes.)
   
   ![Grafana 
Dashboard](https://github.com/user-attachments/assets/08f3ccbd-fca9-481e-ada3-ea02ea000f70)
   
   Adding a `timeout` under [query 
context](https://druid.apache.org/docs/latest/querying/query-context) does help 
with limiting the latency. Here's a try of adding a timeout in the query 
context from 500ms to 5s, and back: 
   ![Changing Query 
Context](https://github.com/user-attachments/assets/7c83805c-80dc-4259-82fe-e31b2078a39c)
   
   To further investigate, I set up a test cluster and modified the 
`druid.server.http.defaultQueryTimeout` in the `historical/runtime.properties` 
file to a very low value (5ms) to trigger a timeout condition. However, the 
query continued to execute as normal (Historical latencies of up to 10s) 
despite this change. 
   
   I added the following lines to `historical/runtime.properties`, and rebuilt 
the Druid image to run on my Kubernetes setup:
   
   ```yaml
   druid.server.http.defaultQueryTimeout=90000
   druid.server.http.maxQueryTimeout=90000
   ```
   
   Despite these configurations, my Historical servers began experiencing 
issues loading segments, resulting in an inability to query Druid's provided 
`trips_xaa` datasource. The error message indicated that the configured timeout 
had reverted to the default of 5 minutes.
   
   ![Timeout 
Error](https://github.com/user-attachments/assets/229e620f-8a28-4b3f-bfa4-6cf801343488)
   
   In an effort to troubleshoot, I thoroughly checked the configuration 
spellings and reviewed ServerConfig.java but found no discrepancies. This 
raises the possibility that the documentation may misrepresent the ability to 
configure timeouts for Historical servers, suggesting it may only be applicable 
to Broker instances.
   
   ### Request for Clarification
   If Druid is indeed lacking the capability to adjust timeout settings for 
Historical servers or if there has been a change in the configuration naming, I 
kindly request an update to the documentation for clarity.
   
   Thank you for your attention to this matter!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to