[ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765348#comment-16765348
 ] 

Bahram Chehrazy commented on HBASE-21788:
-----------------------------------------

I have attached a partial log files going back to the day before when the 
master got restarted a few times. The log shows that one of the Orphan TRSP 
originally initialized at 07:40, then got stuck for about 18 min, then the 
procedure fails while trying to update the meta at 07:58. Perhaps meta also 
crashed at the same time because I see a lot of similar errors for other 
procedures. Shortly after the master crashes and becomes backup. When it become 
active master again in about an hour, it can't read the procWAL logs because 
some of them were corrupted. Unfortunately, the other master in between the gap 
was re-imaged. So, no visibility in between, But I think it's clear now that 
this problem happens when the procWALs get corrupted during master transition.

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-21788
>                 URL: https://issues.apache.org/jira/browse/HBASE-21788
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Assignee: stack
>            Priority: Critical
>         Attachments: WAL-Orphan.log
>
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to