Hi All, We have just switched over to HA namenodes with ZK failover, using HDP-2.3.0.0-2557 (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to make this more stable.
Before we went to HA our namenode was reasonably stable. Now, the namenodes are crashing multiple times a day, and frequently failing to fail over correctly; to the point where I can't even use haadmin -transitionToActive to force a failover. I find that instead I have to restart the namenodes. We're running them on AWS instances with 31.01GB and 8 cores. In addition to the namenode, we host a journalnode, a zkfailovercontroller, and the ambari metrics collector on the same machine. (The third journalnode lives with the yarn resource manager). Right now the namenodes are configured with a maximum heap of 25 GB. Does that sound credible? What else should we be paying attention to to make HDFS stable again? With thanks, Marcin -- Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers> Latest news <http://www.handy.com/press> at Handy Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity