Hi all,

I'm looking into turning on High Availability ResourceManagers in our YARN
cluster (we run a single ResourceManager at the moment).  We plan on using
ZooKeeper for fencing and state storage, and my main concern is around ZK
scalability+performance.  We run a fairly active YARN cluster:

- ~550 NodeManagers
- ~5,000 applications/day
- ~15,000 active containers at any given time
~ usually ~100 applications running at any given time

Since we can't really load-test our setup before turning HA on in
production, I was hoping someone who had run a cluster at similar scale
could give advice on their ZK environment; specifically

- What ZK heap size did you need?
- How many nodes in your ensemble?
- What kind of disks?  Are spinning disks OK, or do you use SSDs?
- Did you need any special configurations around timeouts, etc?

Basically I'm looking either for any horror stories, or hoping that someone
can say that RM HA will be A-OK at this throughput.


Thanks,

Ben

Reply via email to