ZK does pretty much entirely sequential I/O. One thing that it does which might be very, very bad for SSD is that it pre-allocates disk extents in the log by writing a bunch of zeros. This is to avoid directory updates as the log is written, but it doubles the load on the SSD.
On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya <manos...@gmail.com>wrote: > I do not think that there is a problem with the queue size. I guess the > problem is more with latency when the Fusion I/O goes in for a GC. We are > enabling stats on the Zookeeper and the fusion I/O to be more precise. Does > Zookeeper typically do only sequential I/O, or does it do some random too. > We could then move the logs to a disk. > > Thanks, > Manosiz. > > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > If you aren't pushing much data through ZK, there is almost no way that > the > > request queue can fill up without the log or snapshot disks being slow. > > See what happens if you put the log into a real disk or (heaven help us) > > onto a tmpfs partition. > > > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya > > <manos...@gmail.com>wrote: > > > > > I will do as you mention. > > > > > > We are using the async API's throughout. Also we do not write too much > > data > > > into Zookeeper. We just use it for leadership elections and health > > > monitoring, which is why we see the timeouts typically on idle > zookeeper > > > connections. > > > > > > The reason why we want the sessions to be alive is because of the > > > leadership election algorithm that we use from the zookeeper recipe. So > > if > > > a connection is broken for the leader node, the ephemeral node that > > > guaranteed its leadership is lost, and reconnecting will create a new > > node > > > which does not guarantee leadership. We then have to re-elect a new > > leader > > > - which requires significant work. The bigger the timeout, bigger is > the > > > time the cluster stays without a master for a particular service, as > the > > > old master cannot keep on working once it has known its session is gone > > and > > > with it, its ephemeral node. As we are trying to have highly available > > > service (not internet scale, but at the scale of a storage system with > ms > > > latencies typically), we thought about reducing the timeout, but > keeping > > > the session open. Also note the node that typically is the master does > > not > > > write too often into zookeeper. > > > > > > Thanks, > > > Manosiz. > > > > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <ph...@apache.org> > wrote: > > > > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > > > > <manos...@gmail.com> wrote: > > > > > Thanks Patrick for your answer, > > > > > > > > No problem. > > > > > > > > > Actually we are in a virtualized environment, we have a FIO disk > for > > > > > transactional logs. It does have some latency sometimes during FIO > > > > garbage > > > > > collection. We know this could be the potential issue, but was > trying > > > to > > > > > workaround that. > > > > > > > > Ah, I see. I saw something very similar to this recently with SSDs > > > > used for the datadir. The fdatasync latency was sometimes > 10 > > > > seconds. I suspect it happened as a result of disk GC activity. > > > > > > > > I was able to identify the problem by running something like this: > > > > > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt > > > > > > > > and then graphing the results (log scale). You should try running > this > > > > against your servers to confirm that it is indeed the problem. > > > > > > > > > We were trying to qualify the requests into two types - either HB's > > or > > > > > normal requests. Isn't it better to reject normal requests if the > > queue > > > > > size is full to say a certain threshold, but keep the session > alive. > > > That > > > > > way the flow control can be achieved with the users session > retrying > > > the > > > > > operation, but the session health would be maintained. > > > > > > > > What good is a session (connection) that's not usable? You're better > > > > off disconnecting and re-establishing with a server that can process > > > > your requests in a timely fashion. > > > > > > > > ZK looks at availability from a service perspective, not from an > > > > individual session/connection perspective. The whole more important > > > > than the parts. There already is very sophisticated flow control > going > > > > on - e.g. the sessions shut down and stop reading requests when the > > > > number of outstanding requests on a server exceeds some threshold. > > > > Once the server catches up it starts reading again. Again - checkout > > > > your "stat" results for insight into this. (ie "outstanding > requests") > > > > > > > > Patrick > > > > > > > > > >