[jira] Commented: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)

stack (JIRA) Tue, 16 Mar 2010 12:04:50 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846074#action_12846074
 ]


stack commented on HBASE-1736:
------------------------------

This issue is not as bad as it was.  We made a change so that if the RS fails 
to deliver message to master, it'll retry rather than drop the message on the 
ground as it used to.

There is also the notion that loss of a split, while a pain, is not the end of 
the world.  We can redo it if we have to.  There is some comment in the code in 
CompactSplitThread#split where try to reason what happens to split if crash at 
various points along the split.  They could do with review I'd say (smile).

On "It should add the children to the META before closing the parent, so that 
children are not lost if RS crashes after closing parent", what are you 
thinking?  The parent would still be online when we add in the daughter regions 
-- pre-close?  We'd have to deal then with clients coming in and finding the 
daughter regions instead of the parent in .META.  (See how 
HRegion#getClosestRowBefore works and then how its used in 
HCM#locateRegionInMeta).  We might have to add in some code to not hand out 
daughter regions that don't have an info:server (They won't have an info:server 
value in .META. if they have not deployed) but this will still be 
unsatisfactory since most of the time it'll just end up with an offlined parent.

A related, big issue in here is currently closing the parent can take some time 
during which attempts at reaching the parent region are rejected with 
NotServingRegionException.  Parents can take a while to close because flush of 
memory content happens while the 'closing' flag is set on the region.  Its as 
though we need to flush before we go into the close taking on writes while we 
do so and then only flush the small amount we accumulated during the flush 
while under the 'closing' flag (This is still unsatisfactory because if loaded 
system and flushes are slow, memory might be filled by the time flush completes 
and again we'll have a slow close because we're flushing lots of memory).  
There is a 'make splits faster' issue already.  I'm just calling attention to 
it here since it a little related.




> If RS can't talk to master, pause; more importantly, don't split (Currently 
> we do and splits are lost and table is wounded)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1736
>                 URL: https://issues.apache.org/jira/browse/HBASE-1736
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.21.0
>
>
> What I saw was master shutting itself down because it had lost zk lease.  
> Fine.   The RS though doesn't look like it can deal with this situation.    
> We'll see stuff like this:
> {code}
> ...failed on connection exception: java.net.ConnectException: Connection 
> refused
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
>     at $Proxy0.regionServerReport(Unknown Source)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470)
>     at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305)
>     at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826)
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707)
>     ... 4 more
> {code}
> ... all over the regionserver as it tries to send heartbeat to master on this 
> broken connection.
> On split, we close parent, add children to the catalog but then when we try 
> to tell the master about the split, it fails.  Means the children never get 
> deployed.  Meantime  the parent is offline.
> This issue is about going through the regionserver and anytime it has a 
> connection to master, make sure on fault that no damage is done the table and 
> then that the regionserver puts a pause on splitting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded)

Reply via email to