> On Apr 30, 2018, at 5:29 PM, Dave Cottlehuber <[email protected]> wrote: > > On Fri, 27 Apr 2018, at 19:50, David Alan Hjelle wrote: >> Does anyone have recommendations for setting `net_ticktime` to a lower >> value, such as 10, instead of the default 60? In particular, this would >> be for a 3-node cluster on bare metal connected to a single switch. >> >> Background & why I’m asking: >> >> As far as I can tell, if a node fails in certain ways, the rest of the >> cluster can take up to 60 seconds (the default `net_ticktime`) to be >> aware of it—which can pause certain operations (such as a database >> creation) depending on the size of the cluster and quorum settings, etc. >> (For instance, in a 3-node cluster with a single failure, reads and >> writes continue to work, but creating a database waits until the >> _membership is up-to-date.) >> >> It looks like I can lower the Erlang `net_ticktime` setting to make this >> happen more quickly. The Erlang docs indicate that one should be >> cautious in changing this parameter, as it could lead to Couch thinking >> there were partitions when there were none—so I’m curious if anyone has >> any practical experience? >> >> Thanks! > > Hi David, > > I would avoid changing net_ticktime at all, with the same level of concern as > a human would on hearing that a faster pacemaker might improve the > reliability of their heart.... > > Are you seeing node down failures? Are you creating/removing DBs with such > frequency that this is more than a theoretical constraint? > > When you change the tick time you also need to consider that if a 1/4 tick > mark is missed, the runtime will start queueing inter-node traffic until it > decides that the node is down or not. > > I'm interested to know if anybody else has ever tweaked this for couchdb.I > know that its been fiddled with for riak, also around managing scheduler > collapse, but broadly I'd only fiddle with this if I were seeing real world > production problems. > > BTW https://www.rabbitmq.com/nettick.html > <https://www.rabbitmq.com/nettick.html> has the nicest explanation, and you'd > also want to consider http://erlang.org/doc/man/erl.html#+zdbbl > <http://erlang.org/doc/man/erl.html#+zdbbl> as well, see > http://erlang.org/doc/man/erlang.html#system_info_dist_buf_busy_limit > <http://erlang.org/doc/man/erlang.html#system_info_dist_buf_busy_limit> > > A+ > Dave
Thank you so much for your thoughtful and detailed response—fantastic! > I would avoid changing net_ticktime at all, with the same level of concern as > a human would on hearing that a faster pacemaker might improve the > reliability of their heart.... > Are you seeing node down failures? Are you creating/removing DBs with such > frequency that this is more than a theoretical constraint? Purely theoretical at this point. I’m working on upgrading our existing Couch 1.x server to a 2.x cluster, and have been running a lot of tests to make sure I understand how things will behave in production. I was surprised at the delay I described, and found that it was related to the net_ticktime, and that tweaking net_ticktime could change the timing—but even after reading up as much as I could, I still was not clear what the pros and cons would be. For our use case, I’m pretty content to leave it as-is—but always curious as to what knobs I can turn to improve our current set-up. > I'm interested to know if anybody else has ever tweaked this for couchdb. As I mentioned, I’m definitely curious, too. Thanks again!
