Hi Bhuvan, We have never had to change the native_log_write timeout from its default value but we have definitely seen problems with scheduler failovers related to snapshotting. It is indeed an IO intensive operation that may and will block all other activities especially when overlapped with a backup creation. During the snapshot creation an exclusive write lock is held making all other mutation operations impossible. Reads may still be served though.
I would suggest a more thorough investigation to make sure it was truly a native_log_write timeout that caused your failover. Identifying the root cause is crucial here as we have seen two major causes for failovers: excessive GC activity leading to ZK timeouts and slow disk IO blocking writes in underlying native log storage. Below are a few leads: Excessive GC: - consider using snapshot de-duplication [1] if you are not already using it. This has helped us significantly reduce GC activity and stored snapshot size. - consider finely tuning your GC perf. It's not an easy task but there are plenty of online resources to help (e.g. [2]). Excessive IO: - consider changing your underlying system IO scheduler. By just switching from cfq to deadline we have virtually eliminated our failovers due to excessive IO. See AURORA-1211 for details. Thanks, Maxim [1] - https://github.com/apache/aurora/blob/master/docs/scheduler-storage.md [2] - http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/ On Tue, Jun 2, 2015 at 9:33 AM, Bhuvan Arumugam <[email protected]> wrote: > Hello, > > In a 300 nodes cluster with 5 scheduler in the quorum, the replica log > writes fail due to timeout (native_log_write_timeout: 3secs) > especially when 50+ tasks are flapping. The next leader takes around > 2mins+ to complete the log replay and become active. The service is > inaccessible to users, as aurora isn't yet listening on the port. > Users face 503 errors. Why? The snapshot wasn't taken during last few > hours because the crash happen within configured snapshot interval > (default: 1 hour). > > We bumped the log write timeout and in parallel investigating the > reason for timeout, whether it's due to bad hardware, etc. In the > meantime, we want to reduce service disruption to the users by > bringing down the replay time. I like to know, > > a) is reducing snapshot interval (dlog_snapshot_interval) to 30 mins > the right thing to do > b) it snapshot event i/o intensive? > c) it takes 0-6 seconds to snapshot 10k events, from last snapshot. > does the scheduler block user requests when snapshot is in progress? > > Thank you, > -- > Regards, > Bhuvan Arumugam > www.livecipher.com
