Any monitoring of mem, gc, disk, etc... that might give some additional insight? Perhaps the disks were loaded and that was slowing things? Or swapping/gc of the jvm? You might be able to tune to resolve some of that.
One thing you can try is copying the snapshot file to a an empty datadir on a separate machine and try starting a 2 node cluster. (where the second node starts with an empty datadir) Patrick On Tue, Jul 31, 2012 at 3:34 PM, Jordan Zimmerman <[email protected]> wrote: >> Seems you are down to 4gb now. That still seems way too high for >> "coordination" operations… ? > > A big problem currently is detritus nodes. People use lock recipes for > various movie IDs and they leave garbage parent nodes around in the > thousands. I've written some gc tasks to clean them up but it's been a slow > process to get everyone to use it. I know there is a Jira to help with this > but I don't know the status. > > -JZ > > On Jul 31, 2012, at 3:17 PM, Patrick Hunt <[email protected]> wrote: > >> On Tue, Jul 31, 2012 at 3:14 PM, Jordan Zimmerman >> <[email protected]> wrote: >>> There were a lot creations but I removed those nodes last night. How long >>> does it take to clear out of the snapshot? >> >> The snapshot is a copy of whatever is in the znode tree at the time >> the snapshot is taken. (so instantaneous the next time a snapshot is >> taken). You can see the dates and the epoch number if that gives you >> any insight (epoch is the upper 32 bits of the filename) >> >> Seems you are down to 4gb now. That still seems way too high for >> "coordination" operations... ? >> >> Patrick >> >>> >>> On Jul 31, 2012, at 2:52 PM, Patrick Hunt <[email protected]> wrote: >>> >>>> You have an 11gig snapshot file. That's very large. Did someone >>>> unexpectedly overload the server with znode creations? >>>> >>>> When a follower comes up the leader needs to serialize the znodes to >>>> the snapshot file, stream it to the follower, who saves it locally >>>> then deserializes it. (11g/15min is avg about 12meg/second for this >>>> process) >>>> >>>> Often times this is exacerbated by the max heap and GC interactions. >>>> >>>> Patrick >>>> >>>> On Tue, Jul 31, 2012 at 2:23 PM, Jordan Zimmerman >>>> <[email protected]> wrote: >>>>> BTW - this is 3.3.5 >>>>> >>>>> On Jul 31, 2012, at 2:22 PM, Jordan Zimmerman >>>>> <[email protected]> wrote: >>>>> >>>>>> We've had a few outages of our ZK cluster recently. When trying to bring >>>>>> the cluster back up it's been taking 10-15 minutes for the followers to >>>>>> sync with the Leader. Any idea what might cause this? Here's an ls of >>>>>> the data dir: >>>>>> >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 67108880 Jul 31 20:39 >>>>>> log.3900a4bc75 >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 67108880 Jul 31 20:40 >>>>>> log.3900a634ee >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 67108880 Jul 31 21:21 >>>>>> log.3a00000001 >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 67108880 Jul 31 21:22 >>>>>> log.3a000139a2 >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 9279729723 Jul 31 20:42 >>>>>> snapshot.3900a634ec >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 11126306780 Jul 31 21:09 >>>>>> snapshot.3900a6b149 >>>>>> -rw-r--r-- 1 zookeeperserverprod nac 4153727423 Jul 31 21:22 >>>>>> snapshot.3a000139a0 >>>>>> >>>>> >>> >
