[ 
https://issues.apache.org/jira/browse/HBASE-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053659#comment-13053659
 ] 

Aaron Kimball commented on HBASE-3872:
--------------------------------------

A further observation: this seems to have occurred when splitting multiple 
regions within the same table (during a day of large bulk loads). The logs show 
the parent region being offlined, then both daughter regions being 
instantiated. The following sequence of log messages appeared both times:

{code}
2011-06-21 21:51:17,594 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Instantiated (redacted-a-daughter).
2011-06-21 21:51:17,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Instantiated (redacted-b-daughter).
2011-06-21 21:52:05,412 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: 
Hlog roll period 3600000ms elapsed
2011-06-21 21:52:17,666 INFO 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of 
failed split of (redacted-parent-region); Call to 
(redacted-server-address):60020 failed on socket timeout exception: 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=(redacted-server-ip):54054 remote=(redacted-server-address):60020]
{code}

I find it noteworthy that the "Hlog roll period elapsed" message occurred 
between the "B" daughter instantiation and the socket timeout in both cases of 
missing regions I am aware of in my table.


> Hole in split transaction rollback; edits to .META. need to be rolled back 
> even if it seems like they didn't make it
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3872
>                 URL: https://issues.apache.org/jira/browse/HBASE-3872
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3872.txt
>
>
> Saw this interesting one on a cluster of ours.  The cluster was configured 
> with too few handlers so lots of the phenomeneon where actions were queued 
> but then by the time they got into the server and tried respond to the 
> client, the client had disconnected because of the timeout of 60 seconds.  
> Well, the meta edits for a split were queued at the regionserver carrying 
> .META. and by the time it went to write back, the client had gone (the first 
> insert of parent offline with daughter regions added as info:splitA and 
> info:splitB).  The client presumed the edits failed and 'successfully' rolled 
> back the transaction (failing to undo .META. edits thinking they didn't go 
> through).
> A few minutes later the .META. scanner on master runs.  It sees 'no 
> references' in daughters -- the daughters had been cleaned up as part of the 
> split transaction rollback -- so it thinks its safe to delete the parent.
> Two things:
> + Tighten up check in master... need to check daughter region at least exists 
> and possibly the daughter region has an entry in .META.
> + Dependent on the edit that fails, schedule rollback edits though it will 
> seem like they didn't go through.
> This is pretty critical one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to