[
https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845520#action_12845520
]
Aravind Menon commented on HBASE-1736:
--------------------------------------
Hi,
I am looking into this issue with Kannan and Karthik. I was wondering if this
was still an issue. As I see it, the workflow should be as follows:
1. On a split, the RS closes the parent and adds the children to the META
table. (It should add the children to the META before closing the parent, so
that children are not lost if RS crashes after closing parent).
2. RS tries to contact master but fails because master has lost ZK lease.
3. On a master restart, it finds the children in the META table, and assigns
them to an appropriate RS, so children are not lost.
As per this flow, the children should never be lost, so this issue should not
arise.
Would appreciate your feedback on this issue.
Regards,
Aravind
> If RS can't talk to master, pause; more importantly, don't split (Currently
> we do and splits are lost and table is wounded)
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-1736
> URL: https://issues.apache.org/jira/browse/HBASE-1736
> Project: Hadoop HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 0.21.0
>
>
> What I saw was master shutting itself down because it had lost zk lease.
> Fine. The RS though doesn't look like it can deal with this situation.
> We'll see stuff like this:
> {code}
> ...failed on connection exception: java.net.ConnectException: Connection
> refused
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
> at $Proxy0.regionServerReport(Unknown Source)
> at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
> ... 4 more
> {code}
> ... all over the regionserver as it tries to send heartbeat to master on this
> broken connection.
> On split, we close parent, add children to the catalog but then when we try
> to tell the master about the split, it fails. Means the children never get
> deployed. Meantime the parent is offline.
> This issue is about going through the regionserver and anytime it has a
> connection to master, make sure on fault that no damage is done the table and
> then that the regionserver puts a pause on splitting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.