[
https://issues.apache.org/jira/browse/HBASE-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891806#action_12891806
]
HBase Review Board commented on HBASE-2866:
-------------------------------------------
Message from: "Karthik Ranganathan" <[email protected]>
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.hbase.org/r/380/
-----------------------------------------------------------
(Updated 2010-07-23 14:26:01.718168)
Review request for hbase, stack and Kannan Muthukkaruppan.
Changes
-------
Adding hbase group
Summary
-------
Region permanently offlined - if the ZNode is already in the target state, do
not update it again.
This addresses bug HBASE-2866.
http://issues.apache.org/jira/browse/HBASE-2866
Diffs
-----
trunk/src/main/java/org/apache/hadoop/hbase/master/ZKUnassignedWatcher.java
967128
trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWrapper.java
967128
Diff: http://review.hbase.org/r/380/diff
Testing
-------
Ran unit tests, went through fine (except TestRowAtomicity, which is known to
be failing).
Thanks,
Karthik
> Region permanently offlined
> ----------------------------
>
> Key: HBASE-2866
> URL: https://issues.apache.org/jira/browse/HBASE-2866
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Karthik Ranganathan
> Priority: Blocker
> Attachments: master.log
>
>
> After split, master attempts to reassign a region to a region server.
> Occasionally, such a region can get permanently offlined.
> Master:
> ---------
> {code}
> 2010-07-22 01:26:00,914 INFO org.apache.hadoop.hbase.master.ServerManager:
> Processing MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS:
> test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72.: Daughters;
> test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.,
> test1,6531790000,1279787\
> 158624.8d5490bfc166c687657cb09203bd7d44. from
> test024.test.xyz.com,60020,1279780567744; 1 of 1
>
>
>
> 2010-07-22 01:26:00,935 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED
> region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,935 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Creating UNASSIGNED
> region 8d5490bfc166c687657cb09203bd7d44 in state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,945 INFO org.apache.hadoop.hbase.master.RegionManager:
> Assigning region
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. to
> test024.test.xyz.com,60020,1279780567744
> 2010-07-22 01:26:00,949 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: While updating UNASSIGNED
> region 8d5490bfc166c687657cb09203bd7d44 exists, state = M2ZK_REGION_OFFLINE
> 2010-07-22 01:26:00,954 DEBUG org.apache.hadoop.hbase.master.RegionManager:
> Created UNASSIGNED zNode
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. in state
> M2ZK_REGION_OFFLINE
> {code}
> -------------------
> Region Server:
> {code}
> 2010-07-22 01:26:00,947 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,947 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> test1,6512200000,1279787158624.6ead25ae677116cc88fc5420bb39d52e.
> 2010-07-22 01:26:00,947 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> 2010-07-22 01:26:00,948 DEBUG
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater: Updating ZNode
> /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44 with
> [RS2ZK_REGION_OPENING] expected version = 0
> 2010-07-22 01:26:00,952 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: SyncConnected, type: NodeDataChanged, path:
> /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> 2010-07-22 01:26:00,974 WARN
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
> <msgstorectrl001.test.xyz.com,msgstorectrl021.test.xyz.com,msgstorectrl041.test.xyz.com,msgstorectrl061.test.xyz.com,msgstorectrl081.ash2.facebook\
> .com:/hbase,test024.test.xyz.com,60020,1279780567744>Failed to write data to
> ZooKeeper
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1062)
> at
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.updateZKWithEventData(RSZookeeperUpdater.java:161)
> at
> org.apache.hadoop.hbase.regionserver.RSZookeeperUpdater.startRegionOpenEvent(RSZookeeperUpdater.java:115)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1428)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1337)
> at java.lang.Thread.run(Thread.java:619)
> 2010-07-22 01:26:00,975 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.
> java.io.IOException:
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /hbase/UNASSIGNED/8d5490bfc166c687657cb09203bd7d44
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.writeZNode(ZooKeeperWrapper.java:1072)
> {code}
> Meta:
> -----
> Relevant section of META.
> Note that these are the only two entries for the problem region. The first
> one is the parent region (and this problem
> region is its splitB). For the next one, note that there is no "info:server"
> and "info:serverstartcode" columns.
> {code}
> test1,6512200000,12797841 column=info:splitB, timestamp=1279787160693,
> value=\x00\x0A6551820000\x00
> 17114.6466481aa931f8c1fa8
> \x00\x00\x01)\xf9...@test1,6531790000,1279787158624.8d5490bfc166c687657cb
> 7622735487a72.
> 09203bd7d44.\x00\x0A6531790000\x00\x00\x00\x05\x05test1\x00\x00\x00\x00\x
>
> 00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\x00\x00\x07IS_META
>
> \x00\x00\x00\x05false\x00\x00\x00\x01\x08\x07actions\x00\x00\x00\x08\x00\
>
> x00\x00\x0BBLOOMFILTER\x00\x00\x00\x04NONE\x00\x00\x00\x11REPLICATION_SCO
>
> PE\x00\x00\x00\x010\x00\x00\x00\x0BCOMPRESSION\x00\x00\x00\x04NONE\x00\x0
>
> 0\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x00\x00\x00\x0A2147
>
> 483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x00\x09IN_ME
>
> MORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true\x
> FE\xA0\xFD\xC5
> ..
> test1,6531790000,12797871 column=info:regioninfo, timestamp=1279787160782,
> value=REGION => {NAME =>
> 58624.8d5490bfc166c687657
> 'test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44.', STAR
> cb09203bd7d44. TKEY => '6531790000', ENDKEY => '6551820000',
> ENCODED => 8d5490bfc166c687
> 657cb09203bd7d44, TABLE => {{NAME => 'test1',
> FAMILIES => [{NAME => 'acti
> ons', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>
> '0', VERSIONS => '3', C
> OMPRESSION => 'NONE', TTL => '2147483647',
> BLOCKSIZE => '65536', IN_MEMOR
> Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}
> I think Karthik has a handle on the first part (i.e. why the RS ran into the
> version mismatch, and aborted opening the region). He'll add details to the
> JIRA. But what we aren't clear about at this stage is why the base scanner
> didn't kick in and try to reassign the region.
> BTW, HBase "hbck" reported this as well (which was good!):
> {code}
> Number of Tables: 5
> Number of live region servers:92
> Number of dead region servers:0
> .........
> ERROR: Region
> test1,6512200000,1279784117114.6466481aa931f8c1fa87622735487a72. is not
> served by any region server but is listed in META to be on server null
> ERROR: Region
> test1,6531790000,1279787158624.8d5490bfc166c687657cb09203bd7d44. is not
> served by any region server but is listed in META to be on server null
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.