Re: Logger hierarchies in ZK?
On Tue, Jul 20, 2010 at 6:07 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is pretty easy to keep configuration files in general in ZK and reload them on change. Very handy some days! We recently open-sourced tool to handle stuff like config reloads, triggering actions, etc: http://github.com/twitter/twitcher Short version is a single daemon sets your watches and triggers local actions when stuff happens. If your app doesn't speak ZK this might be a good solution. --travis On Tue, Jul 20, 2010 at 5:38 PM, ewhau...@gmail.com wrote: Has anyone experimented with storing logger hierarchies in ZK? I'm looking for a mechanism to dynamically change logger settings across a cluster of daemons. An app that connects to all servers via JMX would solve the problem; but we have a number of subsystems that do not run on the JVM so JMX is not a complete solution. Thanks.
Re: zookeeper crash
Hey all - I believe we just suffered an outage from this issue. Short version is while restarting quorum members with GC flags recommended in the Troubleshooting wiki page a follower logged messages similar two the following jiras: 2010-07-06 23:14:01,438 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 20 is less than our epoch 21 2010-07-06 23:14:01,438 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower https://issues.apache.org/jira/browse/ZOOKEEPER-335 https://issues.apache.org/jira/browse/ZOOKEEPER-790 Reading through the jira's its unclear if the issue is well understood at this point (as there's a patch available) or still being understood. If its still being understood let me know and I can attach the relevant log lines to the appropriate jira. Or if the patch appears good I can make a new release and help test. Let me know :) --travis On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira f...@yahoo-inc.com wrote: I would recommend opening a separate jira issue. I'm not convinced the issues are the same, so I'd rather keep them separate and link the issues if it is the case. -Flavio On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote: We are unable to reproduce this issue. If you can provide the server logs (all servers) and attach them to the jira it would be very helpful. Some detail on the approx time of the issue so we can correlate to the logs would help too (summary of what you did/do to cause it, etc... anything that might help us nail this one down). https://issues.apache.org/jira/browse/ZOOKEEPER-335 Some detail on ZK version, OS, Java version, HW info, etc... would also be of use to us. Patrick On 06/16/2010 02:49 PM, Vishal K wrote: Hi, We are running into this bug very often (almost 60-75% hit rate) while testing our newly developed application over ZK. This is almost a blocker for us. Will the fix be simplified if backward compatibility was not an issue? Considering that this bug is rarely reported, I am wondering why we are running into this problem so often. Also, on a side note, I am curious why the systest that comes with ZooKeeper did not detect this bug. Can anyone please give an overview of the problem? Thanks. -Vishal On Wed, Jun 2, 2010 at 8:17 PM, Charity Majorschar...@shopkick.com wrote: Sure thing. We got paged this morning because backend services were not able to write to the database. Each server discovers the DB master using zookeeper, so when zookeeper goes down, they assume they no longer know who the DB master is and stop working. When we realized there were no problems with the database, we logged in to the zookeeper nodes. We weren't able to connect to zookeeper using zkCli.sh from any of the three nodes, so we decided to restart them all, starting with node one. However, after restarting node one, the cluster started responding normally again. (The timestamps on the zookeeper processes on nodes two and three *are* dated today, but none of us restarted them. We checked shell histories and sudo logs, and they seem to back us up.) We tried getting node one to come back up and join the cluster, but that's when we realized we weren't getting any logs, because log4j.properties was in the wrong location. Sorry -- I REALLY wish I had those logs for you. We put log4j back in place, and that's when we saw the spew I pasted in my first message. I'll tack this on to ZK-335. On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote: charity, do you mind going through your scenario again to give a timeline for the failure? i'm a bit confused as to what happened. ben On 06/02/2010 01:32 PM, Charity Majors wrote: Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable state, though. I said before that we restarted all three nodes, but tracing back, we actually didn't. The zookeeper cluster was refusing all connections until we restarted node one. But once node one had been dropped from the cluster, the other two nodes formed a quorum and started responding to queries on their own. Is that expected as well? I didn't see it in ZOOKEEPER-335, so thought I'd mention it. On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: Hi Charity, unfortunately this is a known issue not specific to 3.3 that we are working to address. See this thread for some background: http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html I've raised the JIRA level to blocker to ensure we address this asap. As Ted suggested you can remove the datadir -- only on the effected server -- and then restart it. That should resolve the issue (the server will d/l a snapshot of the current db from the leader). Patrick On 06/02/2010 11:11 AM, Charity Majors wrote: I upgraded my zookeeper cluster
Zookeeper outage recap questions
Hey zookeepers - We just experienced a total zookeeper outage, and here's a quick post-mortem of the issue, and some questions about preventing it going forward. Quick overview of the setup: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java In chronological order, the issue manifested itself as alert about RW tests failing. Logs were full of too many files errors, and the output of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was 100%. Application logs showed lots of connection timeouts. This suggests an event happened that caused applications to dogpile on Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles to run out and basically game over. I looked through lots of logs (clients+servers) and did not see a clear indication of what happened. Graphs show a sudden decrease in network traffic when the outage began, zookeeper goes cpu bound, and runs our of file descriptors. Clients are primarily a couple thousand C clients using default connection parameters, and a couple thousand python clients using default connection parameters. Digging through Jira we see two issues that probably contributed to this outage: https://issues.apache.org/jira/browse/ZOOKEEPER-662 https://issues.apache.org/jira/browse/ZOOKEEPER-517 Both are tagged for the 3.4.0 release. Anyone know if that's still the case, and when 3.4.0 is roughly scheduled to ship? Thanks! Travis
Re: Zookeeper outage recap questions
I've moved this thread to: https://issues.apache.org/jira/browse/ZOOKEEPER-801 --travis On Thu, Jul 1, 2010 at 12:37 AM, Patrick Hunt ph...@apache.org wrote: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one. One thing we might also consider is modifying the zk client lib to backoff connection attempts if they keep failing (timing out say). Today the clients are pretty aggressive on reconnection attempts. Having some sort of backoff (exponential?) would provide more breathing room to the server in this situation. Patrick On 06/30/2010 11:13 PM, Travis Crawford wrote: Hey zookeepers - We just experienced a total zookeeper outage, and here's a quick post-mortem of the issue, and some questions about preventing it going forward. Quick overview of the setup: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java In chronological order, the issue manifested itself as alert about RW tests failing. Logs were full of too many files errors, and the output of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was 100%. Application logs showed lots of connection timeouts. This suggests an event happened that caused applications to dogpile on Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles to run out and basically game over. I looked through lots of logs (clients+servers) and did not see a clear indication of what happened. Graphs show a sudden decrease in network traffic when the outage began, zookeeper goes cpu bound, and runs our of file descriptors. Clients are primarily a couple thousand C clients using default connection parameters, and a couple thousand python clients using default connection parameters. Digging through Jira we see two issues that probably contributed to this outage: https://issues.apache.org/jira/browse/ZOOKEEPER-662 https://issues.apache.org/jira/browse/ZOOKEEPER-517 Both are tagged for the 3.4.0 release. Anyone know if that's still the case, and when 3.4.0 is roughly scheduled to ship? Thanks! Travis
Re: ZKClient
On Tue, May 4, 2010 at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: Travis, Attachments are stripped from this mailing list. Can you file a JIRA and put your attachment on that instead? Here is a link to get you started: https://issues.apache.org/jira/browse/ZOOKEEPER Whoops. Filed: https://issues.apache.org/jira/browse/ZOOKEEPER-765 --travis On Tue, May 4, 2010 at 3:43 PM, Travis Crawford traviscrawf...@gmail.comwrote: Attached is a skeleton application I extracted from a script I use -- perhaps we could add this as a recipe? If there are issues I'm more than happy to fix them, or add more comments, whatever. It took a while to figure this out and I'd love to save others that time in the future. --travis On Tue, May 4, 2010 at 3:16 PM, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Adam, I don't think zk is very very hard to get right. There are exmaples in src/recipes which implements locks/queues/others. There is ZOOKEEPER-22 to make it even more easier for application to use. Regarding re registration of watches, you can deifnitely write code and submit is as a part of well documented contrib module which lays out the assumptions/design of it. It could very well be useful for others. Its just that folks havent had much time to focus on these areas as yet. Thanks mahadev On 5/4/10 2:58 PM, Adam Rosien a...@rosien.net wrote: I use zkclient in my work at kaChing and I have mixed feelings about it. On one hand it makes easy things easy which is great, but on the other hand I very few ideas what assumptions it makes under the hood. I also dislike some of the design choices such as unchecked exceptions, but that's neither here nor there. It would take some extensive documentation work by the authors to really enumerate the model and assumptions, but the project doesn't seem to be active (either from it being adequate for its current users or just inactive). I'm not sure I could derive the assumptions myself. I'm a bit frustrated that zk is very, very hard to really get right. At a project level, can't we create structures to avoid most of these errors? Can there be a standard model with detailed assumptions and implementations of all the recipes? How can we start this? Is there something that makes this too hard? I feel like a recipe page is a big fail; wouldn't an example app that uses locks and barriers be that much more compelling? For the common FAQ items like you need to re-register the watch, can't we just create code that implements this pattern? My goal is to live up to the motto: a good API is impossible to use incorrectly. .. Adam On Tue, May 4, 2010 at 2:21 PM, Ted Dunning ted.dunn...@gmail.com wrote: In general, writing this sort of layer on top of ZK is very, very hard to get really right for general use. In a simple use-case, you can probably nail it but distributed systems are a Zoo, to coin a phrase. The problem is that you are fundamentally changing the metaphors in use so assumptions can come unglued or be introduced pretty easily. One example of this is the fact that ZK watches *don't* fire for every change but when you write listener oriented code, you kind of expect that they will. That makes it really, really easy to introduce that assumption in the heads of the programmer using the event listener library on top of ZK. Another example is how the atomic get content/set watch call works in ZK is easy to violate in an event driven architecture because the thread that watches ZK probably resets the watch. If you assume that the listener will read the data, then you have introduced a timing mismatch between the read of the data and the resetting of the watch. That might be OK or it might not be. The point is that these changes are subtle and tricky to get exactly right. On Tue, May 4, 2010 at 1:48 PM, Jonathan Holloway jonathan.hollo...@gmail.com wrote: Is there any reason why this isn't part of the Zookeeper trunk already?
Re: Misbehaving zk servers
On Thu, Apr 29, 2010 at 9:49 AM, Patrick Hunt ph...@apache.org wrote: Is there any good (simple/fast/bulletproof) way to monitor the FD use inside the jvm? If so we could stop accepting new client connections once we get close to the os imposed limit... The test would have to be a bulletproof one though - we wouldn't want to end up in some worse situation (where we refuse connection because we mistakenly believe that the limit has been reached). Might be good to open a JIRA for this and add some tests. In particular we should verify the server handles this as gracefully as it can when the limit has been reached. Poking around with jconsole I found two stats that already measure FDs: - java.lang.OperatingSystem.MaxFileDescriptorCount - java.lang.OperatingSystem.OpenFileDescriptorCount They're described (rather tersely) at: http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html So it sounds like the feature request would be stop accepting new client connections if OpenFileDescriptorCount 95% of MaxFileDescriptorCount? Only start accepting new requests when OpenFileDescriptorCount 90% of MaxFileDescriptorCount. Basically the high/low watermark thing. Thoughts? --travis Patrick On 04/29/2010 09:34 AM, Mahadev Konar wrote: Hi Travis, How many clients did you have connected to this server? Usually the default is 8K file descriptors. Did you have clients more than that? Also, if clients fail to attach to a server, they will run off to another server. We do not do any blacklisting because we expect the server to heal and if it does not, it mostly shuts itself down in most of the cases. Thanks mahadev On 4/29/10 12:08 AM, Travis Crawfordtraviscrawf...@gmail.com wrote: Hey zookeeper gurus - We recently had a zookeeper outage when one ZK server was started with a low limit after upgrading to 3.3.0. Several days later the outage occurred when that node reached its file descriptor limit and clients started having major issues. Are there any circumstances when a ZK server will get blacklisted from the ensemble? Something similar to how tasktrackers are blacklisted when too many tasks fail. Thanks! Travis
Re: python client structure
On Wed, Apr 21, 2010 at 12:26 AM, Henry Robinson he...@cloudera.com wrote: Hi Travis - Great to see zkpython getting used. I'm glad you're finding the problems with the documentation - please do file JIRAs with anything you'd like to see improved (and I know there's a lot to improve with zkpython). Yeah I was pretty excited to see the python bindings. Thanks! The best place for help I was able to find is ``help(zookeeper)`` -- is there somewhere else I should be looking instead? It looks like you wrapped the c client, so I also have been looking in zookeeper.h and trying to infer what's going on. You are using the asynchronous form of get_children. This means that ZooKeeper can send you two notifications. The first is called when the get_children call completes. The second is the watcher and is called when the children of the watched node change. You can omit the watcher if you don't need it, or alternatively use the synchronous form which is written get_children. This call doesn't return until the operation is complete, so you don't need to worry about a callback. Ok sounds like that general structure will work then. Thanks for verifying. The first argument to any watcher or callback is the handle of the client that placed the callback. Not the return code! We pass that in so that it's easy to make further ZK calls because the handle is readily available. The second argument for a callback is the return code, and that can be mapped to a string via zerror(rc) if needed (but as you have found, there are numeric return code constants in the module that have readable symbolic names). Aah first argument is the handle! That makes sense. Does this help at all? Let me know if you have any follow on questions. Very much so! If I cleanup the sample code below would you want to put this on a wiki or check in as an example? Would have been nice if someone already figured this out when I started messing with things :) --travis cheers, Henry On 20 April 2010 23:33, Travis Crawford traviscrawf...@gmail.com wrote: Hey zookeeper gurus - I'm getting started with Zookeeper and the python client and an curious if I'm structuring watches correctly. I'd like to watch a znode and do stuff when its children change. Something doesn't feel right about having two methods: one to handle the actual get children call, and one to handle the watch. Does this seem like the right direction? If not, any suggestions on how to better structure things? #!/usr/bin/python import signal import threading import zookeeper import logging logger = logging.getLogger() from optparse import OptionParser options = None args = None class ZKTest(threading.Thread): zparent = '/home/travis/zktest' def __init__(self): threading.Thread.__init__(self) if options.verbose: zookeeper.set_debug_level(zookeeper.LOG_LEVEL_DEBUG) self.zh = zookeeper.init(options.servers) zookeeper.aget_children(self.zh, self.zparent, self.watcher, self.handler) def __del__(self): zookeeper.close(self.zh) def handler(self, rc, rc1, children): Handle zookeeper.aget_children() responses. Args: Arguments are not documented well and I'm not entirely sure what to call these. ``rc`` appears to be the response code, such as OK. However, the only possible mapping of 0 is OK, so in successful cases there appear to be two response codes. The example with no children returned ``rc1`` of -7 which maps to OPERATIONTIMEOUT so that appears to be an error code, but its not clear what was OK in that case. If anyone figures this out I would love to know. Example args: 'args': (0, 0, ['a', 'b']) 'args': (0, -7, []) Does not provide a return value. logger.debug('Processing response: (%d, %d, %s)' % (rc, rc1, children)) if (zookeeper.OK == rc and zookeeper.OK == rc1): logger.debug('Do the actual work here.') else: logger.debug('Error getting children! Retrying.') zookeeper.aget_children(self.zh, self.zparent, self.watcher, self.handler) def watcher(self, rc, event, state, path): Handle zookeeper.aget_children() watches. This code is called when an child znode changes and triggers a child watch. It is not called to handle the aget_children call itself. Numeric arguments map to constants. See ``DATA`` in ``help(zookeeper)`` for more information. Args: rc Return code. event Event that caused the watch (often called ``type`` elsewhere). stats Connection state. path Znode that triggered this watch. Does not provide a return value. logger.debug('Child watch: (%d, %d, %d, %s)' % (rc, event, state, path)) zookeeper.aget_children(self.zh, self.zparent, self.watcher, self.handler) def run(self): while True: pass def
Re: Recovery issue - how to debug?
On Mon, Apr 19, 2010 at 2:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you attach the screen shot to the JIRA issue? The mailing list strips these things. Oops. Updated jira: https://issues.apache.org/jira/browse/ZOOKEEPER-744 --travis On Mon, Apr 19, 2010 at 1:18 PM, Travis Crawford traviscrawf...@gmail.comwrote: Filed: https://issues.apache.org/jira/browse/ZOOKEEPER-744 Attached is a screenshot of some JMX output in Ganglia - its currently implemented using a -javaagent tool I happened to find. Having a simple non-java way to fetch monitoring stats and publish to an external monitoring system would be awesome, and probably reusable by others.
Re: monitoring zookeeper
Hey Kishore - Thanks for the info. I found an interesting library called jmetric ( http://code.google.com/p/jmxetric) that reads MBeans and publishes their contents to Ganglia and its working pretty well. A simplified config looks like: jmxetric-config jvm process=Zookeeper/ sample delay=60 mbean name=org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Leader pname=ZK attribute name=AvgRequestLatency type=double/ attribute name=MaxRequestLatency type=double/ attribute name=MinRequestLatency type=double/ attribute name=OutstandingRequests type=double/ attribute name=PacketsReceived type=double/ attribute name=PacketsSent type=double/ /mbean /sample It doesn't solve the nested property issue, unfortunately, so I may have to flatten some statistics as you have. I'm interested in checking out your code if you don't mind. At a higher level, I'm interested in setting up the sort of monitoring one would expect of a critical datacenter service. To start with, I'd like to collect data necessary to: - page when there's no leader - page when minimum number of replicas to reach quorum are present - email when replicas are missing, but still above quorum minimum. For example, send an email when 1/5 are down, and page when 2/5 are down. Also page if there's no leader for some other reason. The operational metrics like latencies, connections, requests would be useful in troubleshooting issues as well as capacity planning. --travis On Wed, Apr 14, 2010 at 4:50 PM, kishore g g.kish...@gmail.com wrote: Hi Travis, We do monitor zookeeper using JMX. We have a simple code which does the following - parse JMX output and convert the output into key value format. The nested properties are flattened. - Emit the key values using LWES[ http://www.lwes.org/] Api's at regular interval[configurable] - The keys to be emitted can be configured via config file. We have our own internal reporting framework which displays these metrics. In order to differentiate between leader and follower we use separate keys to ReplicatedServer_idXXX_replica.XXX_Follower.AvgRequestLatency=rsf_mrl ReplicatedServer_idXXX_replica.XXX_Leader.AvgRequestLatency=rsl_mrl If the server is leader then rsf_mrl will be empty and vice versa. I can provide the code to do this and you can probably change it to meet your needs and enhance it to work for Ganglia. Let me know if this helps you. thanks, Kishore G On Wed, Apr 14, 2010 at 11:12 AM, Travis Crawford traviscrawf...@gmail.comwrote: Hey zookeeper gurus - Are there any recommended ways for one to monitor zookeeper ensembles? I'm familiar with the four-letter words and that stats are published via JMX - I'm more interested in what people are doing with those stats. I'd like to publish the JMX stats to Ganglia, and this works well for the built-in stats. However, the zookeeper-specific names appear to be dynamic which causes issues when deciding what to publish. For example, the current mode (leader/follower) appears to only be accessible from the bean names, instead of looking at, say, a mode stat. org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader The only way I've found to learn if replicas are up-to-date is looking at synced buried in followerInfo: $ java -jar cmdline-jmxclient-0.10.5.jar - localhost:8081 org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader followerInfo 04/14/2010 18:06:06 + org.archive.jmx.Client followerInfo: FollowerHandler Socket[addr=/10.0.0.10,port=48104,localport=2888] tickOfLastAck:29793 synced?:true queuedPacketLength:0 FollowerHandler Socket[addr=/10.0.0.11,port=59599,localport=2888] tickOfLastAck:29793 synced?:true queuedPacketLength:0 I don't mind writing a tool to parse the JMX output and publishing to Ganglia if needed, but it seems like a problem that may have already been solved and I'm curious what others are doing. The tool would basically take the zookeeper stats, normalize the names, and publish to a timeseries database. Is anyone already monitoring ZK in a way others might find useful? Thanks! Travis