pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1198705979

   So I did migrate to PG-HA in Prod and results shown below in this comment. 
But first, I want to say thank you @abhishekagarwal87 and @AmatyaAvadhanula, 
really do appreciate you guys helping me on this and for having a lot of 
patience!
   
   After the migration from Yugabyte to PG-HA, we have stability for ingestion 
tasks and no more Overlord slowness/degraded behavior. What made this one 
difficult to identify from our side was that all the metrics for Yugabyte DB ( 
latency, operation metrics, etc. ) pointed at somewhat high latency for reads 
and deletes but not enough to cause task failures and the degree of overlord 
slowness in my opinion. And also the amount of errors in our logs for PSQL 
related queries/actions in our overlord logs were around ~1-5 in a 7 day period 
(with info/debug level logging, although it's possible I missed them with my 
search terms). 
   
   Here are the differences for the metric:
   <html>
   <body>
   <!--StartFragment-->
   
   task/action/run/time | Milliseconds taken to execute a task action. | 
dataSource, taskId, taskType | Varies from subsecond to a few seconds, based on 
action type.
   -- | -- | -- | --
   
   
   <!--EndFragment-->
   </body>
   </html>
   
   ```
   ## prod environment
   # new postgresql-ha instance
   SELECT AVG("value")
   FROM druid_metrics
   WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '12' HOUR
   AND metric = 'task/action/run/time'
   AND "value" != 0
   AND service = 'druid/overlord'
   AND taskType = 'index_kafka'
   
   -- 229.71428571428572 = AVG
   
   # old yugabyte instance
   SELECT AVG("value")
   FROM druid_metrics
   -- WHERE __time BETWEEN '2022-07-00T00:00:00' AND '2022-07-15T00:00:00'
   WHERE metric = 'task/action/run/time'
   AND "value" != 0
   AND service = 'druid/overlord'
   AND taskType = 'index_kafka'
   
   -- 212280.36231884058 = AVG
   ```
   
   So the issue was the metadata DB (in our case Yugabyte) as 
@abhishekagarwal87 pointed out. The symptoms for this type of issue include:
   
   1. Peon tasks failing with not responding to pause even when you have a 
low/moderate number of tasks running.
   2. When you ramp up number of tasks, Overlord responds slowly to UI features 
and starts to throw out many different errors including:
   * "errorMsg": "Task [index_kafka_REDACT_b4b8fdbe7d46f26_mbljmdld] failed to 
return status, killing task"
   * "errorMsg": "Task [index_kafka_REDACT_ff20e3161a9445e_bkjimalf] failed to 
stop in a timely manner, killing task"
   * "errorMsg": "The worker that this task is assigned did not start it in 
timeout[PT10M]. See overlord logs for more..."
   * "errorMsg": "Task [index_kafka_REDACT_941fd57f52aebbb_gbbmjhmp] returned 
empty offsets after pause"
   3. `task/action/run/time` metric for druid overlord being high and not a 
reasonable value (should be in the hundreds of ms range or less it appears)
   
   If you see these symptoms, you should consider testing out a different 
metadata DB to see if that relieves the issue.
   
   Now the question becomes why did Yugabyte perform this way without 
noticeable metrics in latency, etc.? I am not a DB person/expert and what 
follows is all speculation based off my assumptions.
   
   I was curious and ran benchmark tests using pgbench and ysql_bench 
(Yugabytes version of pgbench), what I noticed was that our Yugabyte instance 
was considerably faster but had a high percentage of errors when handling 
transactions. 
   
   ```
   ## PostgreSQL-HA
   number of clients: 10
   number of threads: 1
   number of transactions per client: 50
   number of transactions actually processed: 500/500
   latency average = 287.625 ms
   initial connection time = 2393.702 ms
   tps = 34.767502 (without initial connection time)
   
   ## Yugabyte
   number of clients: 10
   number of threads: 1
   batch size: 1024
   number of transactions per client: 50
   number of transactions actually processed: 359/500
   number of errors: 141 (28.200%)
   maximum number of tries: 1
   latency average = 26.926 ms (including errors)
   tps = 266.659221 (including connections establishing)
   tps = 275.665696 (excluding connections establishing)
   ```
   
   These errors include things like:
   ```
   client 5 continues a failed transaction in command 8 (SQL) of script 0; 
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
   
   client 5 continues a failed transaction in command 9 (SQL) of script 0; 
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
   
   client 5 ends the failed transaction (try 1/1)
   client 6 got a failure in command 8 (SQL) of script 0; ERROR:  Operation 
failed. Try again: Value write after transaction start: { physical: 
1658850089065181 } >= { physical: 1658850089061610 }: kConflict
   
   client 9 got a failure in command 10 (SQL) of script 0; ERROR:  Operation 
expired: Transaction c25d12c8-b7b8-4f73-a89f-f0593cf7766c expired or aborted by 
a conflict: 40001
   
   client 9 ends the failed transaction (try 1/1)
   client 6 continues a failed transaction in command 9 (SQL) of script 0; 
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
   
   client 6 ends the failed transaction (try 1/1)
   client 2 got a failure in command 8 (SQL) of script 0; ERROR:  Operation 
failed. Try again: Value write after transaction start: { physical: 
1658850089071381 } >= { physical: 1658850089067433 }: kConflict
   ```
   
   As stated, not a DB guy, but these are transactions that are failing and 
require a retry and according to Yugabyte docs:
   """
   _Yugabyte DB by default runs in snapshot isolation level, preventing phantom 
and non-repeatable reads. In an environment where multiple transactions 
conflict with each other, it's possible the data has changed since it was the 
last read and before committing in a read/write transaction. This causes the 
current transaction to read the stale data or read data from the past. In such 
cases, the read operation needs to be restarted at a higher hybrid time.
   Snapshot isolation here uses optimistic locking, allowing concurrent 
transactions to proceed but may need to fail a transaction leading to a 
rollback if two transactions attempt to modify the same data at the same time._
   …
   _[Optimistic 
locking](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) delays 
the checking of whether a transaction meets the isolation and other integrity 
rules until its end, without blocking any of the operations performed as a part 
of the transaction. In scenarios where there are two concurrent transactions 
that conflict with each other (meaning a commit of the changes made by both 
these transactions would violate integrity constraints), one of these 
transactions is aborted. An aborted transaction could immediately be restarted 
and re-executed, or surfaced as an error to the end user._
   
   _In scenarios where only a few transactions conflict with each other, 
optimistic concurrency control is a good strategy. This is generally the case 
in high-volume systems. For example, most web applications have short-lived 
connections to the database._
   
   _YugabyteDB opts for optimistic concurrency in the case of simple 
transactions. This is achieved by assigning a random priority to each of the 
transactions. In the case of a conflict, the transaction with a lower priority 
is aborted. Some transactions that get aborted due to a conflict are internally 
retried while others result in an error to the end application._
   """
   
   Based off that information, it appears the Overlord threads may hang on DB 
operations due to:
   1. Yugabyte specific error which they say needs to be retried but possible 
Overlord does not know how to handle it
   2. Yugabyte is dealing with the retry internally but it never actually does 
or has an internal error during the retry and silently fails and may not 
respond to Overlord
   3. Yugabyte states that their Optimistic Locking strategy is good if you 
have applications that rarely have conflict transactions. It’s possible the 
Overlord issues different queries on the same data with it’s different threads 
and that means Yugabyte is a bad fit for Druid. 
   
   It's possible that Yugabyte can be configured to match the needs of Druid 
but I am not a DB guy, so I opted to use something more established and simple. 
But out of the box, Yugabyte did not work well with Druid and caused our issues 
we've been seeing. I hope this helps future users who may be experiencing the 
same issue and help them resolve their problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to