Hello,

In a 300 nodes cluster with 5 scheduler in the quorum, the replica log
writes fail due to timeout (native_log_write_timeout: 3secs)
especially when 50+ tasks are flapping. The next leader takes around
2mins+ to complete the log replay and become active. The service is
inaccessible to users, as aurora isn't yet listening on the port.
Users face 503 errors. Why? The snapshot wasn't taken during last few
hours because the crash happen within configured snapshot interval
(default: 1 hour).

We bumped the log write timeout and in parallel investigating the
reason for timeout, whether it's due to bad hardware, etc. In the
meantime, we want to reduce service disruption to the users by
bringing down the replay time. I like to know,

a) is reducing snapshot interval (dlog_snapshot_interval) to 30 mins
the right thing to do
b) it snapshot event i/o intensive?
c) it takes 0-6 seconds to snapshot 10k events, from last snapshot.
does the scheduler block user requests when snapshot is in progress?

Thank you,
-- 
Regards,
Bhuvan Arumugam
www.livecipher.com

Reply via email to