Michael Caplan via discuss <[email protected]> writes:
> I'm trying to troubleshoot a performance a replication issue between two
> version of MariaDB.
> Replica A (and primary) is running MariaDB 10.3.31. Replica B is running
> 10.3.39.
> * B is not feeding off of the primary, but A -- that is until we sunset A.
> * A is handing moderate read only traffic
> * B is just running as a replica with no other responsibilities
>
> I have external monitoring to alert when either hits 500+
> Seconds_Behind_Master. Since standing B up, I am seeing it fall behind a few
> times a day. The interesting thing is that A is not falling behind from the
> primary.
> Upon further examination, a scheduled job that executes hundreds of similar
> delete operations off the same table (that has approx. 3 million records) is
> where B is largely getting hung up. Something like this:
>
> DELETE FROM table WHERE x = 800000 AND y = 40000
>
> The table has an index on just x.
I assume that the table is using InnoDB. What is the approximate size of the
table on disk, and what is the size of the InnoDB buffer pool?
> Running ANALYZE FORMAT=JSON SELECT * FROM table WHERE x = 800000 AND y =
> 40000 from B show a striking difference in total time spent:
>
> REPLICA B:
> {
> "query_block": {
> "key": "table_index",
> "rows": 795555,
> "r_rows": 31,
> "r_total_time_ms": 490.08,
> REPLICA A:
> {
> "query_block": {
> "key": "table_index",
> "rows": 1146482,
> "r_rows": 31,
> "r_total_time_ms": 106.34,
So the queries execute the same on either server (same query plan).
IIUC, the query needs to examine a large number of rows, ~1M, to find the
few matching rows, 31 here.
> As a short term "fix", I added a secondary index: create index table_index2
> on table(x, y); This significantly speeds up this operation, and, for now,
> seems to deal with much of the experienced replication lab in B:
> "query_block": {
> "key": "table_index2,
> "rows": 31,
> "r_rows": 31,
> "r_total_time_ms": 0.2037,
Here, the query only needs to touch the few matching rows, 31 again. But the
query is still taking more that 40% of the time the query that examined
almost a million rows.
This all suggests that the bottleneck is disk I/O. It would be surprising to
see a query spend 0.2 *seconds* to read 31 rows through an index
memory-only. Unless there are some huge blobs being selected in each row?
What happens if the query is re-run again immediately afterwards? Does the
fast query (using table_index2) then complete much faster? That would be
expected if the first run is bottlenecked on I/O, since the second run would
run completely from cache.
If I/O is the bottleneck, due to "cold" data not in the buffer pool, that
could explain why B is lagging behind while A is not. Since A is running
read traffic, this traffic may be keeping the table in the buffer pool,
allowing the queries to run a little faster, while on B the replication lag
will spike while the table is being read into the buffer pool.
This is rather speculative, of course, it is hard to determine the root
performance bottleneck from only this kind of data. If the table really has
only 3 million rows, and 1 million of those are scanned in 0.5 seconds by a
single query, it's a bit hard to imagine that I/O for that table alone could
cause 500+ seconds of lag. Still, with a 31-row index scan taking 0.2
seconds, I would start by examining I/O as the likely bottleneck.
Optimistic parallel replication might be an effective means to eliminate the
lag, as it would allow to run these queries in parallel on slave B. If the
bottleneck is I/O, this could help even if there are many conflicts between
the parallel queries, as it will populate the buffer pool faster.
Hope this helps,
- Kristian.
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]