pchang388 commented on issue #12701: URL: https://github.com/apache/druid/issues/12701#issuecomment-1190863443
Hi @abhishekagarwal87 and @AmatyaAvadhanula, I reverted back to the 0.23.0 official release for MM/Peons. I reran the sjk profiler on the Overlord and Peons for a smaller period of time (5 minutes) as requested. One of the peons, had 21 messages for `Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in...` Ran sjk profiler using flags as shown below, example for peon: ``` nohup /opt/druid/sjk-profiling/java-1.8.0-openjdk-1.8.0.332.b09-2.el8_6.x86_64/bin/java -jar sjk.jar stcap -p 1142097 -o druid-1142097-peon.std -t 3600s -i 50ms > profile_peon_1142097.log 2>&1 & ``` Under a normal configuration/settings, this task would have failed since it exhausts the default chatRetries count (https://druid.apache.org/docs/latest/development/extensions-core/kafka-supervisor-reference.html): <html> <body> <!--StartFragment--> chatRetries | Integer | The number of times HTTP requests to indexing tasks will be retried before considering tasks unresponsive. | no (default == 8) -- | -- | -- | -- <!--EndFragment--> </body> </html> The task was only allowed to succeeded because we increased chatRetries from 8 => 25 as a workaround to get tasks to complete for now. If you'd like, I can turn that back down to 8 and let some tasks fail instead but my thought was since we are looking specifically pause request window, the outcome (SUCCESS/FAILED) of the tasks is not that important. Please let me know if you prefer to let it fail (we still have some failing but the chatRetries increase helps a lot but still 10-15% of tasks failing intermittently) and also capture the stack dump of the failure window and I can run it for a longer interval. It's difficult to find out which one will fail and try to get that specific stack dump but I'm working on an ansible playbook that will grep/grab the PID and do it for all Peons, so it may be easier the next time. Here is the Overlord logs showing the 21 tries, which seems very high, and the time to pause was actually ~194 seconds or ~3.23 minutes: ``` 2022-07-20T22:36:01,054 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT2S] 2022-07-20T22:36:03,063 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT4S] 2022-07-20T22:36:07,072 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT8S] 2022-07-20T22:36:15,077 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:36:25,082 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:36:35,087 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:36:45,094 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:36:55,102 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:37:05,114 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:37:15,121 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:37:25,127 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:37:35,134 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:37:45,144 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:37:55,152 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:38:05,160 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:38:15,167 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:38:25,176 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:38:35,181 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:38:45,188 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:38:55,195 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] 2022-07-20T22:39:05,204 INFO [IndexTaskClient-REDACT-2] org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its status to [PAUSED]; will try again in [PT10S] ``` I started the sjk profiler (on Overlord and Peon for this specific task) as soon as the first Overlord log showed up: ~`2022-07-20T22:36:01,054` and it stopped after 5 minutes at: ~ `2022-07-20T22:41:30` Attached is a zip of the Overlord and Peon Flame HTML files, please let me know if you prefer to also see the stack dump file (.std extension) and I can include that in a different comment as well. [index_kafka_REDACT_36184a251c6f5d5_chhibjdl_flame_graphs.zip](https://github.com/apache/druid/files/9154988/index_kafka_REDACT_36184a251c6f5d5_chhibjdl_flame_graphs.zip) Once again, thank you guys for continuing to assist me and I hope we learn something from this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
