Hi Kyle!

Galera operates in such a way that every transaction is replicated to every 
node in the cluster and committed into storage engine before success is 
returned to the client. This means that when the client receives the 
acknowledgement, there is at least one node (the one on which the transaction 
was originally executed on) which has persisted the transaction. So 
theoretically in the scenario you described, at least one of the nodes should 
be able to recover all the acknowledged transactions when the cluster is 
restarted after full crash. The fact that an alternate timeline appears after 
the cluster restart suggests that all the committed transactions were not 
recovered by the storage engine on restart.

The jepsen-galera.cnf has the following InnoDB settings:

# Performance related settings
innodb_autoinc_lock_mode = 2
innodb_flush_log_at_trx_commit = 0

The documentation in 
https://mariadb.com/docs/server/server-usage/storage-engines/innodb/innodb-system-variables#innodb_flush_log_at_trx_commit
 states for value 0: "Nothing is done on commit; rather the log buffer is 
written and flushed to the InnoDB redo log once a second. This gives better 
performance, but a server crash can erase the last second of transactions."

I'd suspect that this is the root cause of the data loss which produces the 
alternate timeline. I run the test with `innodb_flush_log_at_trx_commit = 1` 
(full durability) a few times and didn't observe any test failures.

Another thing which caught my attention was having binlogs enabled (log-bin in 
resources/my.cnf) but log_slave_updates 
(https://mariadb.com/docs/server/ha-and-performance/standard-replication/replication-and-binary-log-system-variables#log_slave_updates)
 not enabled. This again may be a cause for some already acknowledged 
transactions to be lost during crash recovery. Also, if binlogs are enabled, 
the safest setting is to have `sync_binlog=1` 
(https://mariadb.com/docs/server/ha-and-performance/standard-replication/replication-and-binary-log-system-variables#sync_binlog).

- Teemu

Kyle Kingsbury wrote:
> Dear MariaDB & Galera folks,
> 
> I've been trying out MariaDB with Galera Cluster recently, and I keep 
> seeing what looks like write loss when nodes crash and restart. I was 
> wondering if anyone from the MariaDB or Galera teams might be interested 
> in taking a look at my cluster configuration and some logs, and helping 
> figure out what's going on? I've spent a lot of time reading the docs 
> and trying to set things up correctly, but it's definitely possible I'm 
> just Holding The Database Wrong (TM)!
> 
> My workload performs randomly generated transactions which consist of a 
> series of read or append operations. Each operation reads or updates a 
> single row by primary key. Each row contains a TEXT field with a list of 
> comma-separated integers. The only writes in this workload append a 
> unique integer to one of these lists. In a Snapshot Isolated or 
> Repeatable Read system like MariaDB, all versions of a single row's 
> value should be prefices of the longest such value.
> 
> This is true in single-node MariaDB, but is not true with Galera 
> replication. Instead, it appears that the effects of a few dozen 
> committed writes can be lost, then replaced by what appears to be an 
> alternate timeline" of different writes The attached image shows a 
> series of reads of key 112. Time flows top to bottom, and the list of 
> integers are shown after `txn`. The timeline ending in `...53, 56, 57, 
> 58, 71` is destroyed around 50 seconds into the test, and replaced by 
> `... 158, 159, ...`. This coincided with a crash and restart of all 
> three nodes in the cluster.
> 
> This happens with MariaDB 12.1.2 and Galera 26.4.13 on Debian 12, using 
> the official MariaDB repositories for both MariaDB and Galera. The test 
> suite to reproduce this is at https://github.com/jepsen-io/mysql; use 
> commit 3500f8c80bd0f419d7f21a7b89eaf65f8651a7af, and try something like:
> 
> lein run test-all --nodes n1,n2,n3 -w append --concurrency 6n --nemesis 
> kill --time-limit 300 --test-count 5 --isolation repeatable-read 
> --expected-consistency-model snapshot-isolation
> 
> For an example failing case, including config files and the 
> error/general logs on each node, see:
> 
> https://s3.amazonaws.com/jepsen.io/analyses/mariadb-galera-12.1.2/20260105T1...
> 
> If anyone has ideas about what might be going on here, I'd love to hear 
> from you. :-)
> 
> Cheers,
> 
> --Kyle
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to