[GitHub] [druid] pchang388 commented on issue #12701: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

GitBox Fri, 15 Jul 2022 13:04:49 -0700


pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1185869951


   Since the Peon seems to be unable to pause in a reasonable timeframe or at 
times unresponsive/hung, I took a look at some of the metrics for the common 
actions it would do during a task lifecycle. 
   
   According to the docs:
   
   ```
   An indexing task starts running and building a new segment. It must 
determine the identifier of the segment before it starts building it. For a 
task that is appending (like a Kafka task, or an index task in append mode) 
this is done by calling an "allocate" API on the Overlord to potentially add a 
new partition to an existing set of segments. For a task that is overwriting 
(like a Hadoop task, or an index task not in append mode) this is done by 
locking an interval and creating a new version number and new set of segments.
   
   When the indexing task has finished reading data for the segment, it pushes 
it to deep storage and then publishes it by writing a record into the metadata 
store.
   ```
   
   So during the `READING` phase (we did see a few, also gave an example 
earlier, fail because it didn't pause but was in the `READING` phase ) , it is 
communicating to Overlord via API to allocate new partition to an existing set 
of segments. 
   
   And during the `PUBLISH` phase it is pushing to Object store and also 
Metadata DB, so looking at some the general state for that:
   
   1. SQL Read/Write/Update Performance in our Metadata DB (we are using 
yugabyte DB - distrubuted/HA postgres -  in Kubernetes due to VM capacity 
constraints on our side) - not as much data yet I recently enabled prom 
scrapping for it:
   
![image](https://user-images.githubusercontent.com/51681873/179300525-2c52e13b-cea4-47f8-bdd9-cc5f3cf143c6.png)
   
![image](https://user-images.githubusercontent.com/51681873/179300545-d23d6c67-4b32-48a3-be04-1e6c985a2113.png)
   
![image](https://user-images.githubusercontent.com/51681873/179300566-93fa1cac-4c9e-4149-ab7e-c9127d4850d2.png)
   
![image](https://user-images.githubusercontent.com/51681873/179300584-e760e313-a70b-4137-9c1a-32d061e757f3.png)
   * The select and delete operations appear to be quite high. Depending on 
application, select statements usually should be returned fairly quickly (< 1 
second) especially for user visible processes and ours seems quite high but 
unsure how much of an affect this would have on tasks. But write performance 
seems to be okay.
   
   2. Object Storage Pushes and Persists by Peon, some of the larger objects 
appear to take a longer time than expected, especially with multi-part upload 
but unsure if that is being used by druid:
   ```
   
Segment[REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_14]
 of 274,471,389 bytes built from 27 incremental persist(s) in 42,830ms; pushed 
to deep storage in 47,408ms
   
Segment[REDACT_2022-07-14T17:00:00.000Z_2022-07-14T18:00:00.000Z_2022-07-14T17:47:19.425Z_32]
 of 42,782,021 bytes built from 12 incremental persist(s) in 4,177ms; pushed to 
deep storage in 5,958ms
   
Segment[REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_4]
 of 224,815,291 bytes built from 22 incremental persist(s) in 33,123ms; pushed 
to deep storage in 40,514ms
   ```
   
   I hope this background information provides more details into our 
setup/configuration. Hopefully makes it easier to spot a potential 
issue/bottleneck (like the Overlord seems to be). 
   
   I really do appreciate the help @abhishekagarwal87 and @AmatyaAvadhanula. My 
next steps is to get the flame graphs for the peons to get an idea of what the 
threads are doing. But please let me know if you have any further suggestions 
or things to try or I should provide any more information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] pchang388 commented on issue #12701: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

Reply via email to