> On Apr 30, 2018, at 5:29 PM, Dave Cottlehuber <[email protected]> wrote:
> 
> On Fri, 27 Apr 2018, at 19:50, David Alan Hjelle wrote:
>> Does anyone have recommendations for setting `net_ticktime` to a lower 
>> value, such as 10, instead of the default 60? In particular, this would 
>> be for a 3-node cluster on bare metal connected to a single switch.
>> 
>> Background & why I’m asking:
>> 
>> As far as I can tell, if a node fails in certain ways, the rest of the 
>> cluster can take up to 60 seconds (the default `net_ticktime`) to be 
>> aware of it—which can pause certain operations (such as a database 
>> creation) depending on the size of the cluster and quorum settings, etc. 
>> (For instance, in a 3-node cluster with a single failure, reads and 
>> writes continue to work, but creating a database waits until the 
>> _membership is up-to-date.)
>> 
>> It looks like I can lower the Erlang `net_ticktime` setting to make this 
>> happen more quickly. The Erlang docs indicate that one should be 
>> cautious in changing this parameter, as it could lead to Couch thinking 
>> there were partitions when there were none—so I’m curious if anyone has 
>> any practical experience?
>> 
>> Thanks!
> 
> Hi David,
> 
> I would avoid changing net_ticktime at all, with the same level of concern as 
> a human would on hearing that a faster pacemaker might improve the 
> reliability of their heart....
> 
> Are you seeing node down failures? Are you creating/removing DBs with such 
> frequency that this is more than a theoretical constraint?
> 
> When you change the tick time you also need to consider that if a 1/4  tick  
> mark is missed, the runtime will start queueing inter-node traffic until it 
> decides that the node is down or not.
> 
> I'm interested to know if anybody else has ever tweaked this for couchdb.I 
> know that its been fiddled with for riak, also around managing scheduler 
> collapse, but broadly I'd only fiddle with this if I were seeing real world 
> production problems. 
> 
> BTW https://www.rabbitmq.com/nettick.html 
> <https://www.rabbitmq.com/nettick.html> has the nicest explanation, and you'd 
> also want to consider http://erlang.org/doc/man/erl.html#+zdbbl 
> <http://erlang.org/doc/man/erl.html#+zdbbl> as well, see 
> http://erlang.org/doc/man/erlang.html#system_info_dist_buf_busy_limit 
> <http://erlang.org/doc/man/erlang.html#system_info_dist_buf_busy_limit>
> 
> A+
> Dave

Thank you so much for your thoughtful and detailed response—fantastic!

> I would avoid changing net_ticktime at all, with the same level of concern as 
> a human would on hearing that a faster pacemaker might improve the 
> reliability of their heart....

> Are you seeing node down failures? Are you creating/removing DBs with such 
> frequency that this is more than a theoretical constraint?

Purely theoretical at this point. I’m working on upgrading our existing Couch 
1.x server to a 2.x cluster, and have been running a lot of tests to make sure 
I understand how things will behave in production. I was surprised at the delay 
I described, and found that it was related to the net_ticktime, and that 
tweaking net_ticktime could change the timing—but even after reading up as much 
as I could, I still was not clear what the pros and cons would be. For our use 
case, I’m pretty content to leave it as-is—but always curious as to what knobs 
I can turn to improve our current set-up.

> I'm interested to know if anybody else has ever tweaked this for couchdb.

As I mentioned, I’m definitely curious, too.

Thanks again!

Reply via email to