pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1185848032

   Due to these current issues, we aren't enabling all our supervisors since 
the task completion % seems to correlate to the number of running tasks/load 
that we have in the druid cluster. 
   
   But with the chatRetries increase workaround, I did try to turn on some 
additional supervisors which increases the # of running tasks in our cluster. 
What happened after that is serious performance degradation on the Overlord 
(possibly more services but Overlord was the most noticable especially since we 
use the Overlord console/UI - REDACT.com/unified-console.html#ingestion - to 
see how things are running).
   
   Task rollout and Task completion management from the rollout was seriously 
delayed during this time and tasks were either failing or very delayed (not 
respecting the taskDuration - 1 hour - and completionTimeout - 30 or 45 min), 
examples:
   
   1. Some tasks failing with: `"errorMsg": "Task 
[index_kafka_REDACT_941fd57f52aebbb_gbbmjhmp] returned empty offsets after 
pause"`
   2. A few with: `"errorMsg": "The worker that this task is assigned did not 
start it in timeout[PT10M]. See overlord logs for more..."`
   3. A few with: 
   * `"errorMsg": "Task [index_kafka_REDACT_b4b8fdbe7d46f26_mbljmdld] failed to 
return status, killing task"`
   * `"errorMsg": "Task [index_kafka_REDACT_ff20e3161a9445e_bkjimalf] failed to 
stop in a timely manner, killing task"`
   * `"errorMsg": "Task [index_kafka_REDACT_b5157008402d2aa_ogjhbpod] failed to 
return start time, killing task"`
   * 1-2 with the usual error: ` "errorMsg": "An exception occured while 
waiting for task [index_kafka_REDACT_091c74b39f9c912_hckphlkm] to pause: [..."`
   4. A few long running tasks that did not seem to be tracked/managed properly 
by the Overlord (super long running) - like it couldn't keep up with everything 
going and lost track of these tasks but eventually got to them
   * `SUCCESS - Duration: 2:16:05`
   * `SUCCESS - Duration: 2:17:05`
   * `SUCCESS - Duration: 2:16:06`
   
   I posted out Overlord config earlier for reference but it sounds like we may 
need some additional tuning there especially at our scale but I definitely 
think others ingest more than us and may not have seen these issues so it may 
be specific to our set up or dependencies (Object storage and/or Metadata DB). 
I also noticed that when we try to run more of our supervisors, TCP connections 
made to the Overlord spike heavily but resource usage (16 CPU and 128G RAM 
servers) does not which seems unexpected to me since it should consume more to 
keep up:
   
   * TCP connections made to the Overlord before it started having issues 
managing issues, we failed over to the 2nd replica ~2-3 AM on the chart
   
![image](https://user-images.githubusercontent.com/51681873/179297846-470b64bf-e3d0-4d36-8530-aa2a9046393d.png)
   
   * CPU/MEM Node Exporter stats for the Overlord during all tasks enabled - 
very CPU load average as well:
   
![image](https://user-images.githubusercontent.com/51681873/179297980-db56eb5b-eac3-4e0a-918e-806b7ccbc72e.png)
   
   * JVM heap usage for both Overlords:
   
![image](https://user-images.githubusercontent.com/51681873/179298124-741b8b80-9a2e-454f-8655-4a333c43d08b.png)
   
   * Also for reference, Peon heap usage during all tasks enabled (their full 
config is given in a previous comment but 3G heap given in current config):
   
![image](https://user-images.githubusercontent.com/51681873/179298348-c0e27c4d-cc45-4b5f-8818-deb70fd6a670.png)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to