GWphua opened a new issue, #18197: URL: https://github.com/apache/druid/issues/18197
### Affected Version Since Jan 2024 (v28): https://druid.struct.ai/2Kbbf7, possibly since introduction of MM-less extension. Investigation into the problem is conducted on v31 image. ### Description This problem happens in the context of using Kubernetes jobs in place of MiddleManagers. Clusters with MiddleManagers deployed will not face this problem. (Result for cluster with MM below)  When performing INSERT INTO queries, in Druid v31, the web console returns a "Request failed with status code 404" message. (Result for MM-less cluster below)  #### Stack Trace ``` 2025-07-01T03:18:16,150 WARN [qtp1510570627-108] org.apache.druid.indexing.overlord.http.OverlordResource - Failed to stream task reports for task query-be62b874-c86c-4079-8a71-a0867f6953c2 org.jboss.netty.channel.ChannelException: Faulty channel in resource pool at org.apache.druid.java.util.http.client.NettyHttpClient.go(NettyHttpClient.java:134) ~[druid-processing-31.0.1-SNAPSHOT.jar:31.0.1-SNAPSHOT] at org.apache.druid.java.util.http.client.CredentialedHttpClient.go(CredentialedHttpClient.java:48) ~[druid-processing-31.0.1-SNAPSHOT.jar:31.0.1-SNAPSHOT] at org.apache.druid.java.util.http.client.AbstractHttpClient.go(AbstractHttpClient.java:33) ~[druid-processing-31.0.1-SNAPSHOT.jar:31.0.1-SNAPSHOT] at org.apache.druid.k8s.overlord.KubernetesTaskRunner.streamTaskReports(KubernetesTaskRunner.java:298) ~[?:?] at org.apache.druid.indexing.common.tasklogs.TaskRunnerTaskLogStreamer.streamTaskReports(TaskRunnerTaskLogStreamer.java:59) ~[druid-indexing-service-31.0.1-SNAPSHOT.jar:31.0.1-SNAPSHOT] at org.apache.druid.indexing.common.tasklogs.SwitchingTaskLogStreamer.streamTaskReports(SwitchingTaskLogStreamer.java:94) ~[druid-indexing-service-31.0.1-SNAPSHOT.jar:31.0.1-SNAPSHOT] at org.apache.druid.indexing.overlord.http.OverlordResource.doGetReports(OverlordResource.java:810) ~[druid-indexing-service-31.0.1-SNAPSHOT.jar:31.0.1-SNAPSHOT] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] ... Caused by: java.net.ConnectException: Connection refused: /XXXX:8100 ... ``` ### Root Cause Analysis 1. The query is able to succeed, and the segments are able to be pushed on the Historicals. 2. On the dashboard, the execution details are queried using the api `druid/v2/sql/statement/{queryId}`. 3. The following error log is found on the webpage console: `{"error":"druidException","errorCode":"notFound","persona":"USER","category":"NOT_FOUND","errorMessage":"Query [query-a72b5075-0dae-48bf-b1b8-7f66c2e070e6] was not found. The query details are no longer present or might not be of the type [query_controller]. Verify that the id is correct.","context":{}}` ### Code Dive ### Looking into Dashboard API Call 1. The dashboard tries to call a GET API to `druid/v2/sql/statement/{queryId}?detail=true`. 2. The API is provided by `SqlStatementResource#doGetStatus`, and proceeds into the following steps: 1. Authorise and Authenticate 2. Get Statement Status for query through `SqlStatementResource#getStatementStatus` 3. Returns the statement status as an OK `Response` if exists, else build and return a non-OK `Response`. #### Get Statement Status 1. Contact the overlord for the task status with `taskId` = `queryId` at `/druid/indexer/v1/tasks/{taskId}/status`. - Calls `OverlordResource#getTaskStatus` 2. Contact the overlord for the task report with `taskId` = `queryId` at `/druid/indexer/v1/tasks/{taskId}/reports`. - Calls `OverlordResource#doGetReports`, which calls `SwitchingTaskLogStreamer#streamTaskReports` in K8s context. - `SwitchingTaskLogStreamer#streamTaskReport` will first query task report from Kubernetes peon pod, before trying to find task reports in deep storage. - The stack trace shows that there is a problem when trying to reach the Kubernetes peon pods after task shutdown, causing the Overlord to return a 'Faulty channel in resource pool' error. - When a task finishes, `AbstractTask#cleanUp` will try to write task status and report to deep storage. From debugging experiences, there is a problem in this area. #### Kubernetes Task Reports Not Uploaded Into Deep Storage During Task Cleanup 1. Kubernetes tasks fail to push report to Deep Storage after completing, due to missing report file. 2. During task execution, the report file exists in the path `./var/tmp/attempt/{attemptId}/report.json`. 3. Through examination, we note that the `AbstractTask#cleanUp` is trying to read from `./var/tmp/{queryId}/attempt/{attemptId}/report.json` instead. 4. This problem will not happen for MM architecture. #### Finding Task Report + Status for Peon 1. Under CliPeon, we have binded TaskReportFileWriter to SingleFileTaskReportFileWriter. 2. I have added (during my own debugging session) a log to tell me which directory path the tasks will run on. 3. TaskReportFileWriter writes the files to the configured taskDirPath. 4. Below is the logs from MM-less architecture. Note that the missing queryId is possibly a cause for this bug. ```log 2025-07-03T09:57:39,395 INFO [main] org.apache.druid.cli.CliPeon - Running peon in k8s mode 2025-07-03T09:57:39,395 INFO [main] org.apache.druid.cli.CliPeon - Running peon task in taskDirPath[/opt/druid/var/tmp/] attemptId[1] ``` 5. Note that the taskDirPath is different from our expected `/opt/druid/var/tmp/{queryId}`. ```log 2025-07-03T09:50:57,522 INFO [main] org.apache.druid.cli.CliPeon - Running peon task in taskDirPath[var/druid/task/slot0/query-ae23e2ef-283a-4546-a295-68995bd90fe1] attemptId[1] ``` The mismatch in task directory is possibly a root cause of this issue. I am trying to write a fix for this by early next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
