[ 
https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891766#action_12891766
 ] 

stack commented on HBASE-2866:
------------------------------

@Kannan

bq. ...So need to look into what was happening with the other region...

Yes. Would help improve hbck tool.

bq. ...And, surprisingly though, ZK UNASSIGNED node shows 4 entries....

Whats up w/ that?

So, jgray and karthik, the current state of code in master is that its in 
transition still.. we have not yet hit the end point -- that there is still a 
bunch of change coming.  Right?  Can we have a fixup for this issue for now?  
I'd like to roll a new 0.89.x but w/ a fix for this (sounds like you have it 
Karthik).

@Kannan Restart master is how its addressed currently.  Want to add a little 
tool to UI?

@Karthik A region is unassigned in .META. if it does not have a server and 
startcode as this one does.

> Region permanently offlined 
> ----------------------------
>
>                 Key: HBASE-2866
>                 URL: https://issues.apache.org/jira/browse/HBASE-2866
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Karthik Ranganathan
>            Priority: Blocker
>         Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server. 
> Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS: 
> test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters; 
> test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e., 
> test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from 
> test024.test.xyz.com,60020,1279780567744; 1 of 1                              
>                                                                               
>                                                                               
>            
> 2010-07-22 01:26:00,935 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED 
> region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED 
> region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region 
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to 
> test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED 
> region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager: 
> Created UNASSIGNED zNode 
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state 
> M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: 
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: 
> test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: 
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG 
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode 
> /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with 
> [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, 
> state: SyncConnected, type: NodeDataChanged, path: 
> /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: 
> <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to 
> ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
>         at 
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
>         at 
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening 
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException: 
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first 
> one is the parent region (and this problem
> region is its splitB).  For the next one, note that there is no "info:server" 
> and "info:serverstartcode" columns.
> {code}
>  test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693, 
> value=\x00\x0A6551820000\x00
>  17114.6466481aa931f8c1fa8 
> \x00\x00\x01)\xf9...@test1,6531790000,1279787158624.8d5490bfc166c687657cb
>  7622735487a72.            
> 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>                            
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>                            
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>                            
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>                            
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>                            
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>                            
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>                            
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
>                            FE\xA0\xFD\xC5
>  ..
>  test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782, 
> value=REGION => {NAME =>
>  58624.8d5490bfc166c687657  
> 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
>  cb09203bd7d44.            TKEY => '6531790000', ENDKEY => '6551820000', 
> ENCODED => 8d5490bfc166c687
>                            657cb09203bd7d44, TABLE => {{NAME => 'test1', 
> FAMILIES => [{NAME => 'acti
>                            ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => 
> '0', VERSIONS => '3', C
>                            OMPRESSION => 'NONE', TTL => '2147483647', 
> BLOCKSIZE => '65536', IN_MEMOR
>                            Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the 
> version mismatch, and aborted opening the region). He'll add details to the 
> JIRA. But what we aren't clear about at this stage is why the base scanner 
> didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region 
> test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not 
> served by any region server  but is listed in META to be on server null
> ERROR: Region 
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not 
> served by any region server  but is listed in META to be on server null
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to