[ https://issues.apache.org/jira/browse/HBASE-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053659#comment-13053659 ]
Aaron Kimball commented on HBASE-3872: -------------------------------------- A further observation: this seems to have occurred when splitting multiple regions within the same table (during a day of large bulk loads). The logs show the parent region being offlined, then both daughter regions being instantiated. The following sequence of log messages appeared both times: {code} 2011-06-21 21:51:17,594 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated (redacted-a-daughter). 2011-06-21 21:51:17,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated (redacted-b-daughter). 2011-06-21 21:52:05,412 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed 2011-06-21 21:52:17,666 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of failed split of (redacted-parent-region); Call to (redacted-server-address):60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=(redacted-server-ip):54054 remote=(redacted-server-address):60020] {code} I find it noteworthy that the "Hlog roll period elapsed" message occurred between the "B" daughter instantiation and the socket timeout in both cases of missing regions I am aware of in my table. > Hole in split transaction rollback; edits to .META. need to be rolled back > even if it seems like they didn't make it > -------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-3872 > URL: https://issues.apache.org/jira/browse/HBASE-3872 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 0.90.3 > Reporter: stack > Assignee: stack > Priority: Blocker > Fix For: 0.90.4 > > Attachments: 3872.txt > > > Saw this interesting one on a cluster of ours. The cluster was configured > with too few handlers so lots of the phenomeneon where actions were queued > but then by the time they got into the server and tried respond to the > client, the client had disconnected because of the timeout of 60 seconds. > Well, the meta edits for a split were queued at the regionserver carrying > .META. and by the time it went to write back, the client had gone (the first > insert of parent offline with daughter regions added as info:splitA and > info:splitB). The client presumed the edits failed and 'successfully' rolled > back the transaction (failing to undo .META. edits thinking they didn't go > through). > A few minutes later the .META. scanner on master runs. It sees 'no > references' in daughters -- the daughters had been cleaned up as part of the > split transaction rollback -- so it thinks its safe to delete the parent. > Two things: > + Tighten up check in master... need to check daughter region at least exists > and possibly the daughter region has an entry in .META. > + Dependent on the edit that fails, schedule rollback edits though it will > seem like they didn't go through. > This is pretty critical one. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira