My experience with slow fsyncs is that it's almost always due to contention for disk IO. I see that you tuned the snap* sizes down, which is reasonable. You might check what ZK activity is happening during this period? Perhaps some client is hammering the cluster, have you ruled that out?
I searched the mail archives, there are other folks reporting this issue, you might take a look. I found this one in particular that you might checkout: https://lists.apache.org/thread/qjrlprmt7pdy63ztvjtvkd0f5zgw5dgk Patrick On Thu, Apr 18, 2024 at 3:31 AM Xu Bill <xuzhili1...@hotmail.com> wrote: > Hello, > > I have a pretty weird issue of ZooKeeper. > Everyday around 17:30, my ZooKeeper throws a warning message in log says > "fsync-ing the write ahead log in SyncThread:0 took 36919ms which will > adversely effect operation latency.File size is 16777232 bytes.". And this > causes my clients connected to ZooKeeper being timed out. I have to restart > my clients every day. > > Though I don't think the size of the txn log file is too big to be handled > quickly, > still I tried to change parameters to supress the size of txn log. Below > is my configuration. > preAllocSize=16M > snapCount=30000 > snapSizeLimitInKb=32M > > Even with this configuration, I still got the warnings. > > I also tried to monitor the IO stats on data disk which the data dir of > ZooKeeper is in. > But the stats were as the same as usual. > > Can anybody help give suggestions on how to solve or investigate on this > issue? > I am using ZooKeeper 3.7.2. > The IO stats were tps=122, reading=20.1k/s, writing=2M/s, when the warning > was happening. > > Best regards, > Bill >