Region permanently offlined 
----------------------------

                 Key: HBASE-2866
                 URL: https://issues.apache.org/jira/browse/HBASE-2866
             Project: HBase
          Issue Type: Bug
            Reporter: Kannan Muthukkaruppan
            Priority: Blocker


After split, master attempts to reassign a region to a region server. 
Occasionally, such a region can get permanently offlined.


Master:
---------
{code}
2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: 
Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: 
test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; 
test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., 
test1,6531790000,1279787\
158624.8d5490bfc166c687657cb09203bd7d44. from 
test024.test.xyz.com,60020,1279780567744; 1 of 1                                
                                                                                
                                                                                
     
2010-07-22 01:26:00,935 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 
8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE

2010-07-22 01:26:00,935 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED region 
8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE

2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: 
Assigning region 
test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to 
test024.test.xyz.com,60020,1279780567744

2010-07-22 01:26:00,949 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED 
region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE

2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
Created UNASSIGNED zNode 
test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state 
M2ZK_REGION_OFFLINE
{code}

-------------------

Region Server:

{code}
2010-07-22 01:26:00,947 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: 
test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
2010-07-22 01:26:00,947 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: 
test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
2010-07-22 01:26:00,947 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: 
test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
2010-07-22 01:26:00,948 DEBUG 
org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode 
/hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with [RS2ZK_REGION_OPENING] 
expected version = 0
2010-07-22 01:26:00,952 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: 
SyncConnected, type: NodeDataChanged, path: 
/hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
2010-07-22 01:26:00,974 WARN 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: 
<msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
.com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to 
ZooKeeper
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
        at 
org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
        at 
org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
        at java.lang.Thread.run(Thread.java:619)
2010-07-22 01:26:00,975 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening 
test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: 
KeeperErrorCode = BadVersion for 
/hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
{code}

Meta:
-----

Relevant section of META.

Note that these are the only two entries for the problem region. The first one 
is the parent region (and this problem
region is its splitB).  For the next one, note that there is no "info:server" 
and "info:serverstartcode" columns.

{code}
 test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, 
value=\x00\x0A6551820000\x00
 17114.6466481aa931f8c1fa8 
\x00\x00\x01)\xf9...@test1,6531790000,1279787158624.8d5490bfc166c687657cb
 7622735487a72.            
09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
                           
00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
                           
\x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
                           
x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
                           
PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
                           
0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
                           
483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
                           
MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
                           FE\xA0\xFD\xC5

 ..

 test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, 
value=REGION => {NAME =>
 58624.8d5490bfc166c687657  
'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
 cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', 
ENCODED => 8d5490bfc166c687
                           657cb09203bd7d44, TABLE => {{NAME => 'test1', 
FAMILIES => [{NAME => 'acti
                           ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => 
'0', VERSIONS => '3', C
                           OMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE 
=> '65536', IN_MEMOR
                           Y => 'false', BLOCKCACHE => 'true'}]}}
{code}


I think Karthik has a handle on the first part (i.e. why the RS ran into the 
version mismatch, and aborted opening the region). He'll add details to the 
JIRA. But what we aren't clear about at this stage is why the base scanner 
didn't kick in and try to reassign the region.

BTW, HBase "hbck" reported this as well (which was good!):

{code}
Number of Tables: 5
Number of live region servers:92
Number of dead region servers:0
.........
ERROR: Region test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. 
is not served by any region server  but is listed in META to be on server null
ERROR: Region test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. 
is not served by any region server  but is listed in META to be on server null
{code}




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to