pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1196012087
Hi @abhishekagarwal87,
I've been busy the last few days but will provide another update tomorrow.
And thank you for that information! We also believed the metadata DB was the
issue but it was not apparent from regular metrics like read/write latency,
etc.. Yes they were a little high like I mentioned before which led us to seek
alternatives but I did not think they were absurd/out of bounds.
That metric you provided and also your evaluation of the flame graph was
essential in confirming the metadata db as the potential issue. We are in the
process of changing metadata DB from Yugabyte to open-source Postgres HA
(PgPool implementation). We have already done so in our dev and stg
environments and seeing much better improvement for that metric:
```
## dev
# new postgresql-ha instance
SELECT AVG("value")
FROM druid_metrics
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' HOUR
AND metric = 'task/action/run/time'
AND taskType = 'index_kafka'
AND "value" != 0
-- 55.15453915453915 = AVG
# old yugabyte instance
SELECT AVG("value")
FROM druid_metrics
WHERE __time BETWEEN '2022-07-18T00:00:00' AND '2022-07-19T00:00:00'
AND metric = 'task/action/run/time'
AND taskType = 'index_kafka'
AND "value" != 0
-- 10968.234235101569 = AVG
-- 2855.919290736778 = AVG
-- 1996.148148148148 = AVG
## stg
# new postgresql-ha instance
SELECT AVG("value")
FROM druid_metrics
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
AND metric = 'task/action/run/time'
AND taskType = 'index_kafka'
AND "value" != 0
-- 152.95 = AVG
# old yugabyte instance
-- SELECT AVG("value")
-- FROM druid_metrics
-- WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' DAY
-- AND metric = 'task/action/run/time'
-- AND taskType = 'index_kafka'
-- AND "value" != 0
-- 27759.859503034062 = AVG
```
I will do the prod migration and let you know if we also see the same
improvements there. I also have an idea of why Yugabyte/Overlord interactions
were causing locks or long wait times for DB operations. Will post that
information as well in my follow up tomorrow for other people to reference in
case they have similar issues as us.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]