[GitHub] [druid] pchang388 commented on issue #12701: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

GitBox Tue, 26 Jul 2022 14:43:08 -0700


pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1196012087


   Hi @abhishekagarwal87,
   
   I've been busy the last few days but will provide another update tomorrow. 
And thank you for that information! We also believed the metadata DB was the 
issue but it was not apparent from regular metrics like read/write latency, 
etc.. Yes they were a little high like I mentioned before which led us to seek 
alternatives but I did not think they were absurd/out of bounds.
   
   That metric you provided and also your evaluation of the flame graph was 
essential in confirming the metadata db as the potential issue. We are in the 
process of changing metadata DB from Yugabyte to open-source Postgres HA 
(PgPool implementation). We have already done so in our dev and stg 
environments and seeing much better improvement for that metric:
   
   ```
   ## dev
   # new postgresql-ha instance
   SELECT AVG("value")
   FROM druid_metrics
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' HOUR
   AND metric = 'task/action/run/time'
   AND taskType = 'index_kafka'
   AND "value" != 0
   
   -- 55.15453915453915 = AVG
   
   # old yugabyte instance
   SELECT AVG("value")
   FROM druid_metrics
   WHERE __time BETWEEN '2022-07-18T00:00:00' AND '2022-07-19T00:00:00'
   AND metric = 'task/action/run/time'
   AND taskType = 'index_kafka'
   AND "value" != 0
   
   -- 10968.234235101569 = AVG
   -- 2855.919290736778 = AVG
   -- 1996.148148148148 = AVG
   
   
   ## stg
   # new postgresql-ha instance
   SELECT AVG("value")
   FROM druid_metrics
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
   AND metric = 'task/action/run/time'
   AND taskType = 'index_kafka'
   AND "value" != 0
   
   -- 152.95 = AVG
   
   # old yugabyte instance
   -- SELECT AVG("value")
   -- FROM druid_metrics
   -- WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' DAY
   -- AND metric = 'task/action/run/time'
   -- AND taskType = 'index_kafka'
   -- AND "value" != 0
   
   -- 27759.859503034062 = AVG
   ```
   
   I will do the prod migration and let you know if we also see the same 
improvements there. I also have an idea of why Yugabyte/Overlord interactions 
were causing locks or long wait times for DB operations. Will post that 
information as well in my follow up tomorrow for other people to reference in 
case they have similar issues as us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] pchang388 commented on issue #12701: Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

Reply via email to