check zkfc logs first ; try checking the HDFS ha and zookeeper timeouts, its better to have a dedicated disk for journal node service (similar to zookeeper)
On Sat, Dec 19, 2015 at 9:29 AM, Sandeep Nemuri <nhsande...@gmail.com> wrote: > What does the logs say ? > ᐧ > > On Sat, Dec 19, 2015 at 10:08 PM, Marcin Tustin <mtus...@handybook.com> > wrote: > >> Hi All, >> >> We have just switched over to HA namenodes with ZK failover, using >> HDP-2.3.0.0-2557 >> (HDFS 2.7.1.2.3). I'm looking for suggestions as to what to investigate to >> make this more stable. >> >> Before we went to HA our namenode was reasonably stable. Now, the >> namenodes are crashing multiple times a day, and frequently failing to fail >> over correctly; to the point where I can't even use haadmin >> -transitionToActive to force a failover. I find that instead I have to >> restart the namenodes. >> >> We're running them on AWS instances with 31.01GB and 8 cores. In >> addition to the namenode, we host a journalnode, a zkfailovercontroller, >> and the ambari metrics collector on the same machine. (The third >> journalnode lives with the yarn resource manager). >> >> Right now the namenodes are configured with a maximum heap of 25 GB. >> >> Does that sound credible? What else should we be paying attention to to >> make HDFS stable again? >> >> With thanks, >> Marcin >> >> >> Want to work at Handy? Check out our culture deck and open roles >> <http://www.handy.com/careers> >> Latest news <http://www.handy.com/press> at Handy >> Handy just raised $50m >> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >> led >> by Fidelity >> >> > > > -- > * Regards* > * Sandeep Nemuri* >