pchang388 commented on issue #12701: URL: https://github.com/apache/druid/issues/12701#issuecomment-1185848032
Due to these current issues, we aren't enabling all our supervisors since the task completion % seems to correlate to the number of running tasks/load that we have in the druid cluster. But with the chatRetries increase workaround, I did try to turn on some additional supervisors which increases the # of running tasks in our cluster. What happened after that is serious performance degradation on the Overlord (possibly more services but Overlord was the most noticable especially since we use the Overlord console/UI - REDACT.com/unified-console.html#ingestion - to see how things are running). Task rollout and Task completion management from the rollout was seriously delayed during this time and tasks were either failing or very delayed (not respecting the taskDuration - 1 hour - and completionTimeout - 30 or 45 min), examples: 1. Some tasks failing with: `"errorMsg": "Task [index_kafka_REDACT_941fd57f52aebbb_gbbmjhmp] returned empty offsets after pause"` 2. A few with: `"errorMsg": "The worker that this task is assigned did not start it in timeout[PT10M]. See overlord logs for more..."` 3. A few with: * `"errorMsg": "Task [index_kafka_REDACT_b4b8fdbe7d46f26_mbljmdld] failed to return status, killing task"` * `"errorMsg": "Task [index_kafka_REDACT_ff20e3161a9445e_bkjimalf] failed to stop in a timely manner, killing task"` * `"errorMsg": "Task [index_kafka_REDACT_b5157008402d2aa_ogjhbpod] failed to return start time, killing task"` * 1-2 with the usual error: ` "errorMsg": "An exception occured while waiting for task [index_kafka_REDACT_091c74b39f9c912_hckphlkm] to pause: [..."` 4. A few long running tasks that did not seem to be tracked/managed properly by the Overlord (super long running) - like it couldn't keep up with everything going and lost track of these tasks but eventually got to them * `SUCCESS - Duration: 2:16:05` * `SUCCESS - Duration: 2:17:05` * `SUCCESS - Duration: 2:16:06` I posted out Overlord config earlier for reference but it sounds like we may need some additional tuning there especially at our scale but I definitely think others ingest more than us and may not have seen these issues so it may be specific to our set up or dependencies (Object storage and/or Metadata DB). I also noticed that when we try to run more of our supervisors, TCP connections made to the Overlord spike heavily but resource usage (16 CPU and 128G RAM servers) does not which seems unexpected to me since it should consume more to keep up: * TCP connections made to the Overlord before it started having issues managing issues, we failed over to the 2nd replica ~2-3 AM on the chart  * CPU/MEM Node Exporter stats for the Overlord during all tasks enabled - very CPU load average as well:  * JVM heap usage for both Overlords:  * Also for reference, Peon heap usage during all tasks enabled (their full config is given in a previous comment but 3G heap given in current config):  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
