Hi Folks, We are running a 60 Node MapReduce/HBase HDP cluster. HBase 1.1.2 , HDP: 2.3.4.0-3485. Phoenix is enabled on this cluster. Each slave has ~120gb ram. RS has 20 Gb heap, 12 disk of 2Tb each and 24 cores. This cluster has been running OK for last 2 years but recently with few disk failures(we unmounted those disks) it hasnt been running fine. I have checked hbck and hdfs fsck. Both of them report no inconsistency.
Some our RegionServers keeps on aborting with following error: 1 ==> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/data/default/DE.TABLE_NAME/35aa0de96715c33e1f0664aa4d9292ba/recovered.edits/0000000003948161445.temp (inode 420864666): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_-64710857_1, pendingcreates: 1] 2 ==> 2018-02-08 03:09:51,653 ERROR [regionserver/ hdpslave26.bigdataprod1.com/1.16.6.56:16020] regionserver.HRegionServer: Shutdown / close of WAL failed: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /apps/hbase/data/oldWALs/hdpslave26.bigdataprod1.com%2C16020%2C1518027416930.default.1518085177903 (inode 420996935): File is not open for writing. Holder DFSClient_NONMAPREDUCE_649736540_1 does not have any open files. All the LeaseExpiredException are happening for recovered.edits and oldWALs. HDFS is around 48% full. Most of the DN's have 30-40% space left on them. NN heap is at 60% use. I have tried googling around but cant find anything concrete to fix this problem. Currently, 15/60 nodes are already down in last 2 days. Can someone please point out what might be causing these RegionServer failures? -- Thanks & Regards, Anil Gupta
