Hello, In a 300 nodes cluster with 5 scheduler in the quorum, the replica log writes fail due to timeout (native_log_write_timeout: 3secs) especially when 50+ tasks are flapping. The next leader takes around 2mins+ to complete the log replay and become active. The service is inaccessible to users, as aurora isn't yet listening on the port. Users face 503 errors. Why? The snapshot wasn't taken during last few hours because the crash happen within configured snapshot interval (default: 1 hour).
We bumped the log write timeout and in parallel investigating the reason for timeout, whether it's due to bad hardware, etc. In the meantime, we want to reduce service disruption to the users by bringing down the replay time. I like to know, a) is reducing snapshot interval (dlog_snapshot_interval) to 30 mins the right thing to do b) it snapshot event i/o intensive? c) it takes 0-6 seconds to snapshot 10k events, from last snapshot. does the scheduler block user requests when snapshot is in progress? Thank you, -- Regards, Bhuvan Arumugam www.livecipher.com
