pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1198705979
So I did migrate to PG-HA in Prod and results shown below in this comment.
But first, I want to say thank you @abhishekagarwal87 and @AmatyaAvadhanula,
really do appreciate you guys helping me on this and for having a lot of
patience!
After the migration from Yugabyte to PG-HA, we have stability for ingestion
tasks and no more Overlord slowness/degraded behavior. What made this one
difficult to identify from our side was that all the metrics for Yugabyte DB (
latency, operation metrics, etc. ) pointed at somewhat high latency for reads
and deletes but not enough to cause task failures and the degree of overlord
slowness in my opinion. And also the amount of errors in our logs for PSQL
related queries/actions in our overlord logs were around ~1-5 in a 7 day period
(with info/debug level logging, although it's possible I missed them with my
search terms).
Here are the differences for the metric:
<html>
<body>
<!--StartFragment-->
task/action/run/time | Milliseconds taken to execute a task action. |
dataSource, taskId, taskType | Varies from subsecond to a few seconds, based on
action type.
-- | -- | -- | --
<!--EndFragment-->
</body>
</html>
```
## prod environment
# new postgresql-ha instance
SELECT AVG("value")
FROM druid_metrics
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' HOUR
AND metric = 'task/action/run/time'
AND "value" != 0
AND service = 'druid/overlord'
AND taskType = 'index_kafka'
-- 229.71428571428572 = AVG
# old yugabyte instance
SELECT AVG("value")
FROM druid_metrics
-- WHERE __time BETWEEN '2022-07-00T00:00:00' AND '2022-07-15T00:00:00'
WHERE metric = 'task/action/run/time'
AND "value" != 0
AND service = 'druid/overlord'
AND taskType = 'index_kafka'
-- 212280.36231884058 = AVG
```
So the issue was the metadata DB (in our case Yugabyte) as
@abhishekagarwal87 pointed out. The symptoms for this type of issue include:
1. Peon tasks failing with not responding to pause even when you have a
low/moderate number of tasks running.
2. When you ramp up number of tasks, Overlord responds slowly to UI features
and starts to throw out many different errors including:
* "errorMsg": "Task [index_kafka_REDACT_b4b8fdbe7d46f26_mbljmdld] failed to
return status, killing task"
* "errorMsg": "Task [index_kafka_REDACT_ff20e3161a9445e_bkjimalf] failed to
stop in a timely manner, killing task"
* "errorMsg": "The worker that this task is assigned did not start it in
timeout[PT10M]. See overlord logs for more..."
* "errorMsg": "Task [index_kafka_REDACT_941fd57f52aebbb_gbbmjhmp] returned
empty offsets after pause"
3. `task/action/run/time` metric for druid overlord being high and not a
reasonable value (should be in the hundreds of ms range or less it appears)
If you see these symptoms, you should consider testing out a different
metadata DB to see if that relieves the issue.
Now the question becomes why did Yugabyte perform this way without
noticeable metrics in latency, etc.? I am not a DB person/expert and what
follows is all speculation based off my assumptions.
I was curious and ran benchmark tests using pgbench and ysql_bench
(Yugabytes version of pgbench), what I noticed was that our Yugabyte instance
was considerably faster but had a high percentage of errors when handling
transactions.
```
## PostgreSQL-HA
number of clients: 10
number of threads: 1
number of transactions per client: 50
number of transactions actually processed: 500/500
latency average = 287.625 ms
initial connection time = 2393.702 ms
tps = 34.767502 (without initial connection time)
## Yugabyte
number of clients: 10
number of threads: 1
batch size: 1024
number of transactions per client: 50
number of transactions actually processed: 359/500
number of errors: 141 (28.200%)
maximum number of tries: 1
latency average = 26.926 ms (including errors)
tps = 266.659221 (including connections establishing)
tps = 275.665696 (excluding connections establishing)
```
These errors include things like:
```
client 5 continues a failed transaction in command 8 (SQL) of script 0;
ERROR: current transaction is aborted, commands ignored until end of
transaction block
client 5 continues a failed transaction in command 9 (SQL) of script 0;
ERROR: current transaction is aborted, commands ignored until end of
transaction block
client 5 ends the failed transaction (try 1/1)
client 6 got a failure in command 8 (SQL) of script 0; ERROR: Operation
failed. Try again: Value write after transaction start: { physical:
1658850089065181 } >= { physical: 1658850089061610 }: kConflict
client 9 got a failure in command 10 (SQL) of script 0; ERROR: Operation
expired: Transaction c25d12c8-b7b8-4f73-a89f-f0593cf7766c expired or aborted by
a conflict: 40001
client 9 ends the failed transaction (try 1/1)
client 6 continues a failed transaction in command 9 (SQL) of script 0;
ERROR: current transaction is aborted, commands ignored until end of
transaction block
client 6 ends the failed transaction (try 1/1)
client 2 got a failure in command 8 (SQL) of script 0; ERROR: Operation
failed. Try again: Value write after transaction start: { physical:
1658850089071381 } >= { physical: 1658850089067433 }: kConflict
```
As stated, not a DB guy, but these are transactions that are failing and
require a retry and according to Yugabyte docs:
"""
_Yugabyte DB by default runs in snapshot isolation level, preventing phantom
and non-repeatable reads. In an environment where multiple transactions
conflict with each other, it's possible the data has changed since it was the
last read and before committing in a read/write transaction. This causes the
current transaction to read the stale data or read data from the past. In such
cases, the read operation needs to be restarted at a higher hybrid time.
Snapshot isolation here uses optimistic locking, allowing concurrent
transactions to proceed but may need to fail a transaction leading to a
rollback if two transactions attempt to modify the same data at the same time._
…
_[Optimistic
locking](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) delays
the checking of whether a transaction meets the isolation and other integrity
rules until its end, without blocking any of the operations performed as a part
of the transaction. In scenarios where there are two concurrent transactions
that conflict with each other (meaning a commit of the changes made by both
these transactions would violate integrity constraints), one of these
transactions is aborted. An aborted transaction could immediately be restarted
and re-executed, or surfaced as an error to the end user._
_In scenarios where only a few transactions conflict with each other,
optimistic concurrency control is a good strategy. This is generally the case
in high-volume systems. For example, most web applications have short-lived
connections to the database._
_YugabyteDB opts for optimistic concurrency in the case of simple
transactions. This is achieved by assigning a random priority to each of the
transactions. In the case of a conflict, the transaction with a lower priority
is aborted. Some transactions that get aborted due to a conflict are internally
retried while others result in an error to the end application._
"""
Based off that information, it appears the Overlord threads may hang on DB
operations due to:
1. Yugabyte specific error which they say needs to be retried but possible
Overlord does not know how to handle it
2. Yugabyte is dealing with the retry internally but it never actually does
or has an internal error during the retry and silently fails and may not
respond to Overlord
3. Yugabyte states that their Optimistic Locking strategy is good if you
have applications that rarely have conflict transactions. It’s possible the
Overlord issues different queries on the same data with it’s different threads
and that means Yugabyte is a bad fit for Druid.
It's possible that Yugabyte can be configured to match the needs of Druid
but I am not a DB guy, so I opted to use something more established and simple.
But out of the box, Yugabyte did not work well with Druid and caused our issues
we've been seeing. I hope this helps future users who may be experiencing the
same issue and help them resolve their problems.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]