Hi all, I'm looking into turning on High Availability ResourceManagers in our YARN cluster (we run a single ResourceManager at the moment). We plan on using ZooKeeper for fencing and state storage, and my main concern is around ZK scalability+performance. We run a fairly active YARN cluster:
- ~550 NodeManagers - ~5,000 applications/day - ~15,000 active containers at any given time ~ usually ~100 applications running at any given time Since we can't really load-test our setup before turning HA on in production, I was hoping someone who had run a cluster at similar scale could give advice on their ZK environment; specifically - What ZK heap size did you need? - How many nodes in your ensemble? - What kind of disks? Are spinning disks OK, or do you use SSDs? - Did you need any special configurations around timeouts, etc? Basically I'm looking either for any horror stories, or hoping that someone can say that RM HA will be A-OK at this throughput. Thanks, Ben