[ https://issues.apache.org/jira/browse/ZOOKEEPER-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921251#action_12921251 ]
Alexandre Hardy commented on ZOOKEEPER-885: ------------------------------------------- Hi Patrick, {quote} 1) you are applying the load using dd to all three servers at the same time, is that correct? (not just to 1 server) {quote} Correct. If {{dd}} is run on only one machine then the likelihood of disconnects is reduced. Unfortunately our typical scenario would involve load on all three machines. {quote} 2) /dev/mapper indicates some sort of lvm setup, can you give more detail on that? (fyi http://ubuntuforums.org/showthread.php?t=646340) {quote} Yes, we have an lvm setup on a single spindle. The nimbula-test logical volume is 10G in size and (obviously) shares the same spindle as root and log (/var/log) partitions. {quote} 3) you mentioned that this: echo 5 > /proc/sys/vm/dirty_ratio echo 5 > /proc/sys/vm/dirty_background_ratio resulting in "stability in this test", can you tell us what this was set to initially? Checkout this article: http://lwn.net/Articles/216853/ {quote} The initial value for {{/proc/sys/vm/dirty_ratio}} is 20, an the initial value for {{/proc/sys/vm/dirty_background_ratio}} is 10. These machines have 1G of RAM, and thus are less susceptible to the problems mentioned in http://lwn.net/Articles/216853/ (as I see it). I have run a more complete benchmark with random IO instead of {{dd}} sequential IO testing session timeouts, and the effect of {{dirty_ratio}} settings. I will attach that separately. {{dirty_ratio}} seems to help with the {{dd}} test but has much less influence in the random IO test. {quote} I notice you are running a "bigmem" kernel. What's the total memory size? How large of a heap have to assigned to the ZK server? (jvm) {quote} We have 1G on each machine in this test system and 100M heap size for each zookeeper server. {quote} 4) Can you verify whether or not the JVM is swapping? Any chance that the server JVM is swapping, which is causing the server to pause, which then causes the clients to time out? This seems to me like it would fit the scenario - esp given that when you turn the "dirty_ratio" down you see stability increase (the time it would take to complete the flush would decrease, meaning that the server can respond before the client times out). {quote} I'm not entirely sure of all the JVM internals, but all swap space on the linux system was disabled. So no swapping based on the linux kernel would happen. I'm not sure if the JVM does any swapping of its own? I concur with your analysis. What puzzles me is why the system would even get into a state where the zookeeper server would have to wait so long for a disk flush? In the case of {{dd if=/dev/urandom}} the IO rate is quite low, and there should (I think) be more than enough IOPS available for zookeeper to flush data to disk in time. Even if the IO scheduling results in this scenario, it is still not clear to me why zookeeper would fail to respond to a ping. My only conclusion at this stage is that responding to a ping requires information to be flushed to disk. Is this correct? Referring to your private e-mail: {quote} > The weird thing here is that there should be no delay for these pings. {quote} This would indicate to me that the ping response should not be dependent on any disk IO. Thanks for all the effort in looking into this! > Zookeeper drops connections under moderate IO load > -------------------------------------------------- > > Key: ZOOKEEPER-885 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-885 > Project: Zookeeper > Issue Type: Bug > Components: server > Affects Versions: 3.2.2, 3.3.1 > Environment: Debian (Lenny) > 1Gb RAM > swap disabled > 100Mb heap for zookeeper > Reporter: Alexandre Hardy > Priority: Critical > Attachments: tracezklogs.tar.gz, tracezklogs.tar.gz, > WatcherTest.java, zklogs.tar.gz > > > A zookeeper server under minimum load, with a number of clients watching > exactly one node will fail to maintain the connection when the machine is > subjected to moderate IO load. > In a specific test example we had three zookeeper servers running on > dedicated machines with 45 clients connected, watching exactly one node. The > clients would disconnect after moderate load was added to each of the > zookeeper servers with the command: > {noformat} > dd if=/dev/urandom of=/dev/mapper/nimbula-test > {noformat} > The {{dd}} command transferred data at a rate of about 4Mb/s. > The same thing happens with > {noformat} > dd if=/dev/zero of=/dev/mapper/nimbula-test > {noformat} > It seems strange that such a moderate load should cause instability in the > connection. > Very few other processes were running, the machines were setup to test the > connection instability we have experienced. Clients performed no other read > or mutation operations. > Although the documents state that minimal competing IO load should present on > the zookeeper server, it seems reasonable that moderate IO should not cause > problems in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.