Hi Chien, here is pastebin with log: http://pastebin.com/6x7umBvZ Quick summary is:
*RSRpcServices Close 19a02bdebe1cca4eae5509a62fdd217d, moving to null* What does it mean, "moving to null"? It means put it offline and no RS will serve it HRegionServer Received CLOSE for the region: 19a02bdebe1cca4eae5509a62fdd217d, which we are already trying to CLOSE, but not completed yet *Server node05.cluster.pro <http://node05.cluster.pro>,60020,1451388046169 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: *org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: The region 19a02bdebe1cca4eae5509a62fdd217d was already closing. New CLOSE request is ignored. at org.apache.hadoop.hbase.regionserver.HRegionServer.closeRegion(HRegionServer.java:2689) at org.apache.hadoop.hbase.regionserver.RSRpcServices.closeRegion(RSRpcServices.java:1033) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:20870) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) for my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d., try=1 of 10 Then I tried to repair table: node03.cluster.pro INFO April 21, 2016 4:07 PM HRegion Closed my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d. View Log File node04.cluster.pro INFO May 6, 2016 10:09 PM MasterRpcServices Client=hbase//148.251.186.9 assign my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d. View Log File node05.cluster.pro INFO May 6, 2016 10:09 PM RSRpcServices Close 19a02bdebe1cca4eae5509a62fdd217d, moving to null View Log File node04.cluster.pro INFO May 6, 2016 10:09 PM RegionStates Transition {19a02bdebe1cca4eae5509a62fdd217d state=FAILED_CLOSE, ts=1461242442267, server=node05.cluster.pro,60020,1451388046169} to {19a02bdebe1cca4eae5509a62fdd217d state=OFFLINE, ts=1462561784994, server= node05.cluster.pro,60020,1451388046169} And it started to work. 2016-05-07 6:09 GMT+02:00 Chien Le <chie...@gmail.com>: > I would try grepping for the region id (19a02bdebe1cca4eae5509a62fdd217d) > through the logs of hbase master and the last known regionserver to host > it. Can you share via pastebin? > > -Chien > > On Fri, May 6, 2016 at 12:18 PM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Hi, I have HBase cluster running on HBase 1.0.0-cdh5.4.4 >> I do periodically get NotServingRegionException and I can't find the >> reason >> for such exception. It happens randomly on different tables. >> >> *hbase hbase hbck my_weird_table* >> >> *reports*: >> >> >> ERROR: Region { meta => >> my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d., >> hdfs => >> >> hdfs://nameservice1/hbase/data/default/my_weird_table/19a02bdebe1cca4eae5509a62fdd217d, >> deployed => , replicaId => 0 } not deployed on any region server. >> >> ERROR: Region { meta => >> my_weird_table,4f5c0e14,1447343972523.69fa4ad7a33868e938f25e5cbdb8cd08., >> hdfs => >> >> hdfs://nameservice1/hbase/data/default/my_weird_table/69fa4ad7a33868e938f25e5cbdb8cd08, >> deployed => , replicaId => 0 } not deployed on any region server. >> >> ERROR: Region { meta => >> my_weird_table,b0a3cf6:,1448475527400.d4c2bda6f776be97e369371fed1ea674., >> hdfs => >> >> hdfs://nameservice1/hbase/data/default/my_weird_table/d4c2bda6f776be97e369371fed1ea674, >> deployed => , replicaId => 0 } not deployed on any region server. >> >> 16/05/06 23:03:13 INFO util.HBaseFsck: Handling overlap merges in >> parallel. >> set hbasefsck.overlap.merge.parallel to false to run serially. >> >> ERROR: There is a hole in the region chain between 4f5c0e14 and 51eb8510. >> You need to create a new .regioninfo and region dir in hdfs to plug the >> hole. >> >> ERROR: There is a hole in the region chain between 70a3d6f6 and 73332fbb. >> You need to create a new .regioninfo and region dir in hdfs to plug the >> hole. >> >> ERROR: There is a hole in the region chain between b0a3cf6: and b3333313. >> You need to create a new .regioninfo and region dir in hdfs to plug the >> hole. >> >> 16/05/06 23:03:13 INFO util.HBaseFsck: Handling overlap merges in >> parallel. >> set hbasefsck.overlap.merge.parallel to false to run serially. >> >> ERROR: Found inconsistency in table my_weird_table >> >> >> >> *Summary:* >> >> * hbase:meta is okay.* >> >> * Number of regions: 1* >> >> * Deployed on: node05.cluster.pro >> <http://node05.cluster.pro>,60020,1451388046169* >> >> * my_weird_table is okay.* >> >> * Number of regions: 98* >> >> * Deployed on: node01.cluster.pro >> <http://node01.cluster.pro>,60020,1453774572201 node02. cluster.pro >> <http://cluster.pro>,60020,1458087229508 node04. cluster.pro >> <http://cluster.pro>,60020,1447338864601 node05. cluster.pro >> <http://cluster.pro>,60020,1451388046169* >> >> *6 inconsistencies detected.* >> >> >> *Status: INCONSISTENT* >> >> then I run* hbase hbase hbck -repair my_weird_table* >> >> >> #### Output omitted for brevity. >> 6/05/06 23:09:43 INFO util.HBaseFsck: No integrity errors. We are done >> with this phase. Glorious. >> Number of live region servers: 5 >> Number of dead region servers: 0 >> Master: node04.cluster.pro,60000,1450130273717 >> Number of backup masters: 1 >> Average load: 167.8 >> Number of requests: 4884 >> Number of regions: 839 >> Number of regions in transition: 23 >> >> >> RROR: Region { meta => >> my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d., >> hdfs => >> >> hdfs://nameservice1/hbase/data/default/my_weird_table/19a02bdebe1cca4eae5509a62fdd217d, >> deployed => , replicaId => 0 } not deployed on any region server. >> Trying to fix unassigned region... >> 16/05/06 23:09:45 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> 19a02bdebe1cca4eae5509a62fdd217d, NAME => >> 'my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d.', >> STARTKEY => '70a3d6f6', ENDKEY => '73332fbb'} >> 16/05/06 23:09:46 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> 19a02bdebe1cca4eae5509a62fdd217d, NAME => >> 'my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d.', >> STARTKEY => '70a3d6f6', ENDKEY => '73332fbb'} >> 16/05/06 23:09:47 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> 19a02bdebe1cca4eae5509a62fdd217d, NAME => >> 'my_weird_table,70a3d6f6,1448462346185.19a02bdebe1cca4eae5509a62fdd217d.', >> STARTKEY => '70a3d6f6', ENDKEY => '73332fbb'} >> ERROR: Region { meta => >> my_weird_table,4f5c0e14,1447343972523.69fa4ad7a33868e938f25e5cbdb8cd08., >> hdfs => >> >> hdfs://nameservice1/hbase/data/default/my_weird_table/69fa4ad7a33868e938f25e5cbdb8cd08, >> deployed => , replicaId => 0 } not deployed on any region server. >> Trying to fix unassigned region... >> 16/05/06 23:09:48 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> 69fa4ad7a33868e938f25e5cbdb8cd08, NAME => >> 'my_weird_table,4f5c0e14,1447343972523.69fa4ad7a33868e938f25e5cbdb8cd08.', >> STARTKEY => '4f5c0e14', ENDKEY => '51eb8510'} >> 16/05/06 23:09:49 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> 69fa4ad7a33868e938f25e5cbdb8cd08, NAME => >> 'my_weird_table,4f5c0e14,1447343972523.69fa4ad7a33868e938f25e5cbdb8cd08.', >> STARTKEY => '4f5c0e14', ENDKEY => '51eb8510'} >> ERROR: Region { meta => >> my_weird_table,b0a3cf6:,1448475527400.d4c2bda6f776be97e369371fed1ea674., >> hdfs => >> >> hdfs://nameservice1/hbase/data/default/my_weird_table/d4c2bda6f776be97e369371fed1ea674, >> deployed => , replicaId => 0 } not deployed on any region server. >> Trying to fix unassigned region... >> 16/05/06 23:09:50 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> d4c2bda6f776be97e369371fed1ea674, NAME => >> 'my_weird_table,b0a3cf6:,1448475527400.d4c2bda6f776be97e369371fed1ea674.', >> STARTKEY => 'b0a3cf6:', ENDKEY => 'b3333313'} >> 16/05/06 23:09:51 INFO util.HBaseFsckRepair: Region still in transition, >> waiting for it to become assigned: {ENCODED => >> d4c2bda6f776be97e369371fed1ea674, NAME => >> 'my_weird_table,b0a3cf6:,1448475527400.d4c2bda6f776be97e369371fed1ea674.', >> STARTKEY => 'b0a3cf6:', ENDKEY => 'b3333313'} >> 16/05/06 23:09:52 INFO util.HBaseFsck: Handling overlap merges in >> parallel. >> set hbasefsck.overlap.merge.parallel to false to run serially. >> ERROR: There is a hole in the region chain between 4f5c0e14 and 51eb8510. >> You need to create a new .regioninfo and region dir in hdfs to plug the >> hole. >> ERROR: There is a hole in the region chain between 70a3d6f6 and 73332fbb. >> You need to create a new .regioninfo and region dir in hdfs to plug the >> hole. >> ERROR: There is a hole in the region chain between b0a3cf6: and b3333313. >> You need to create a new .regioninfo and region dir in hdfs to plug the >> hole. >> 16/05/06 23:09:52 INFO util.HBaseFsck: Handling overlap merges in >> parallel. >> set hbasefsck.overlap.merge.parallel to false to run serially. >> ERROR: Found inconsistency in table my_weird_table >> 16/05/06 23:09:59 INFO zookeeper.RecoverableZooKeeper: Process >> identifier=hbase Fsck connecting to ZooKeeper ensemble= >> node04.cluster.pro:2181,node01.cluster.pro:2181,node05.cluster.pro:2181 >> >> *16/05/06 23:09:59 INFO zookeeper.ClientCnxn: EventThread shut down* >> *Summary:* >> * hbase:meta is okay.* >> * Number of regions: 1* >> * Deployed on: node05.cluster.pro >> <http://node05.cluster.pro>,60020,1451388046169* >> * my_weird_table is okay.* >> * Number of regions: 98* >> * Deployed on: node01.cluster.pro >> <http://node01.cluster.pro>,60020,1453774572201 node02.cluster.pro >> <http://node02.cluster.pro>,60020,1458087229508 node04.cluster.pro >> <http://node04.cluster.pro>,60020,1447338864601 node05.cluster.pro >> <http://node05.cluster.pro>,60020,1451388046169* >> *6 inconsistencies detected.* >> *Status: INCONSISTENT* >> 16/05/06 23:10:00 INFO util.HBaseFsck: Sleeping 10000ms before re-checking >> after fix... >> Version: 1.0.0-cdh5.4.4 >> 16/05/06 23:10:10 INFO util.HBaseFsck: Loading regioninfos HDFS >> 16/05/06 23:10:10 INFO util.HBaseFsck: Loading HBase regioninfo from >> HDFS... >> 16/05/06 23:10:10 INFO util.HBaseFsck: Checking HBase region split map >> from >> HDFS data... >> 16/05/06 23:10:10 INFO util.HBaseFsck: Handling overlap merges in >> parallel. >> set hbasefsck.overlap.merge.parallel to false to run serially. >> 16/05/06 23:10:10 INFO util.HBaseFsck: Handling overlap merges in >> parallel. >> set hbasefsck.overlap.merge.parallel to false to run serially. >> *16/05/06 23:10:10 INFO util.HBaseFsck: No integrity errors. We are done >> with this phase. Glorious.* >> *Number of live region servers: 5* >> *Number of dead region servers: 0* >> *Master: node04.cluster.pro <http://node04.cluster.pro >> >,60000,1450130273717* >> *Number of backup masters: 1* >> *Average load: 167.8* >> *Number of requests: 4884* >> *Number of regions: 839* >> *Number of regions in transition: 23* >> >> 16/05/06 23:10:10 INFO util.HBaseFsck: Loading regionsinfo from the >> hbase:meta table >> >> Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 0 >> 16/05/06 23:10:11 INFO util.HBaseFsck: getHTableDescriptors == tableNames >> => [my_weird_table] >> 1Number of Tables: 1 >> >> *Summary:* >> * hbase:meta is okay.16/05/06 23:10:18 INFO zookeeper.ClientCnxn: >> EventThread shut down* >> >> * Number of regions: 1* >> * Deployed on: node05.cluster.pro >> <http://node05.cluster.pro>,60020,1451388046169* >> * my_weird_table is okay.* >> * Number of regions: 101* >> * Deployed on: node01.cluster.pro >> <http://node01.cluster.pro>,60020,1453774572201 node02.cluster.pro >> <http://node02.cluster.pro>,60020,1458087229508 node03.cluster.pro >> <http://node03.cluster.pro>,60020,1461244112276 node04.cluster.pro >> <http://node04.cluster.pro>,60020,1447338864601 node05.cluster.pro >> <http://node05.cluster.pro>,60020,1451388046169* >> *0 inconsistencies detected.* >> *Status: OK* >> >> >> So, why do some regions are not served? >> >> Why does -repair helps? What makes my table to be broken and partially >> unavailable? >> > >