pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1190863443

   Hi @abhishekagarwal87  and @AmatyaAvadhanula,
   
   I reverted back to the 0.23.0 official release for MM/Peons. 
   
   I reran the sjk profiler on the Overlord and Peons for a smaller period of 
time (5 minutes) as requested. One of the peons, had 21 messages for `Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in...`
   
   Ran sjk profiler using flags as shown below, example for peon:
   ```
   nohup 
/opt/druid/sjk-profiling/java-1.8.0-openjdk-1.8.0.332.b09-2.el8_6.x86_64/bin/java
 -jar sjk.jar stcap -p 1142097 -o druid-1142097-peon.std -t 3600s -i 50ms > 
profile_peon_1142097.log 2>&1 &
   ```
   
   Under a normal configuration/settings, this task would have failed since it 
exhausts the default chatRetries count 
(https://druid.apache.org/docs/latest/development/extensions-core/kafka-supervisor-reference.html):
 
   
   <html>
   <body>
   <!--StartFragment-->
   
   chatRetries | Integer | The number of times HTTP requests to indexing tasks 
will be retried before considering tasks unresponsive. | no (default == 8)
   -- | -- | -- | --
   
   
   <!--EndFragment-->
   </body>
   </html>
   
   The task was only allowed to succeeded because we increased chatRetries from 
8 => 25 as a workaround to get tasks to complete for now. If you'd like, I can 
turn that back down to 8 and let some tasks fail instead but my thought was 
since we are looking specifically pause request window, the outcome 
(SUCCESS/FAILED) of the tasks is not that important. Please let me know if you 
prefer to let it fail (we still have some failing but the chatRetries increase 
helps a lot but still 10-15% of tasks failing intermittently) and also capture 
the stack dump of the failure window and I can run it for a longer interval. 
It's difficult to find out which one will fail and try to get that specific 
stack dump but I'm working on an ansible playbook that will grep/grab the PID 
and do it for all Peons, so it may be easier the next time.
   
   Here is the Overlord logs showing the 21 tries, which seems very high, and 
the time to pause was actually ~194 seconds or ~3.23 minutes:
   ```
   2022-07-20T22:36:01,054 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT2S]
   2022-07-20T22:36:03,063 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT4S]
   2022-07-20T22:36:07,072 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT8S]
   2022-07-20T22:36:15,077 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:36:25,082 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:36:35,087 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:36:45,094 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:36:55,102 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:37:05,114 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:37:15,121 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:37:25,127 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:37:35,134 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:37:45,144 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:37:55,152 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:38:05,160 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:38:15,167 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:38:25,176 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:38:35,181 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:38:45,188 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:38:55,195 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-20T22:39:05,204 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_36184a251c6f5d5_chhibjdl] to change its 
status to [PAUSED]; will try again in [PT10S]
   ```
   
   I started the sjk profiler (on Overlord and Peon for this specific task) as 
soon as the first Overlord log showed up: ~`2022-07-20T22:36:01,054` and it 
stopped after 5 minutes at: ~ `2022-07-20T22:41:30`
   
   Attached is a zip of the Overlord and Peon Flame HTML files, please let me 
know if you prefer to also see the stack dump file (.std extension) and I can 
include that in a different comment as well. 
   
   
[index_kafka_REDACT_36184a251c6f5d5_chhibjdl_flame_graphs.zip](https://github.com/apache/druid/files/9154988/index_kafka_REDACT_36184a251c6f5d5_chhibjdl_flame_graphs.zip)
   
   Once again, thank you guys for continuing to assist me and I hope we learn 
something from this.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to