Do you use Puppet? On Fri, Nov 2, 2012 at 1:13 PM, Dan Brodsky <[email protected]> wrote:
> Ram, > > I wanted to follow up with you since you helped me with your below comment. > > It turns out that the ZK configuration files somehow got changed (reverted > to their default values?), and I'm not sure who/when/how. The zoo.cfg files > didn't have the list of quorum peers, and the myid files that told each ZK > peer their ordinal value had been deleted. So, effectively, I had three ZK > standalone servers, instead of one quorum. > > Problem fixed, Hbase is happy again. > > Cheers, > > Dan > > > > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan < > [email protected]> wrote: > > > Can you try like start any of the regionservers that are not connecting > at > > all. May be start 2 of them. > > Observer master logs. See whether it says > > 'Waiting for RegionServers to checkin'?. > > > > Just to confirm your ZK ip and port is correct thro out the cluster? If > > multitenant cluster then you may be the other regionservers are > connecting > > to someother ZK cluster? > > Wild guess :) > > > > Regards > > Ram > > > -----Original Message----- > > > From: Dan Brodsky [mailto:[email protected]] > > > Sent: Wednesday, October 17, 2012 6:31 PM > > > To: [email protected] > > > Subject: Regionservers not connecting to master > > > > > > Good morning, > > > > > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three > > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > > > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase > > > regionservers. > > > > > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with > > > no meaningful error messages), and since then, I have been unable to > > > get all 10 regionservers to connect to the Hbase master. I've tried > > > bringing the cluster down and rebooting all the boxes, but no joy. The > > > machines are all running, and hbase-regionserver appears to start > > > normally on each one. > > > > > > Right now, my master status page (http://namenode:60010) shows 3 > > > regionservers online. There are also dozens of regions in transition > > > listed on the status page (in the PENDING_OPEN state), but each of > > > those are on one of the regionservers already online. > > > > > > The 7 other regionservers' log files show a successful connection to > > > one ZK peer, followed by a regular trail of these messages: > > > > > > 2012-10-17 12:36:08,394 DEBUG > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 > > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > > > hitRatio=0cachingAccesses=0, cachingHits=0, > > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN > > > > > > If I had to wager a guess, it seems like the 7 offline regionservers > > > are not connecting to other ZK peers, but there isn't anything in the > > > ZK logs to indicate why. > > > > > > Thoughts? > > > > > > Dan > > > > > -- Kevin O'Dell Customer Operations Engineer, Cloudera
