HBASE-21283

> On Dec 6, 2018, at 8:55 AM, Andrew Purtell <[email protected]> wrote:
> 
> I recently added a shell command "rit" that displays the list of current RIT. 
> Would that have worked? It does require that the master is responsive to a 
> GetClusterStatus request. 
> 
> 
>> On Dec 6, 2018, at 7:45 AM, Sean Busbey <[email protected]> wrote:
>> 
>> This week I've run into two cases where I needed the set of regions in
>> transition so I could recover them and I ran into what I think is a
>> gap in our operator tooling. I'm hoping folks will have some ideas
>> I've missed.
>> 
>> Depending on how this thread goes, I'll make some follow-on on the
>> dev@hbase list for implementing changes and documentation.
>> 
>> Case 1: HBase 1.2-ish RIT following RS crash
>> 
>> Cluster had a handful of region servers fail and for whatever reason a
>> few regions were stuck in transition. The operator I was helping
>> already is used to dealing with the occasional manual recovery. Their
>> normal process looks like this:
>> 
>> 1) Got to Master UI website
>> 2) Scroll down to Regions in Transition list
>> 3) Find a RIT in FAILED_CLOSE / FAILED_OPEN / PENDING_OPEN
>> 4) confirm on RS logs that the RS associated in the above is now in
>> good health and doesn't expect to do anything with said region
>> 5) run "assign" in the hbase shell for the region
>> 
>> Unfortunately, the cluster's HDFS was under duress and so listing
>> snapshot information was super slow. This caused the Master UI website
>> to hang prior to displaying the RIT list.
>> 
>> We ended up looking at the master log file.
>> 
>> Case 2: HBase 2.1-ish RIT following cluster wide crash
>> 
>> AFAICT cluster had experienced a failure of all RS and masters. Upon
>> coming back up Master was left with ~10% of ~10K regions in a state of
>> PENDING_OPEN or OPENING all with a RS that had no idea it was involved
>> with those regions. I'm pretty sure this is a bug;  I'm still triaging
>> it and I don't think it's relevant to the current question.
>> 
>> Once I confirmed the given RS was not currently doing anything for any
>> of those regions I figured I'd use HBCK2 to run an assigns to get
>> things fixed. However, since there were like 900 RITs, the Master UI
>> was unusable for getting a complete list. Also with that many all in
>> the same state I want to be able to automate running against each of
>> them.
>> 
>> I ended up greping the master log file and pulling out the WARN
>> messages about RIT to tease out the list of regions, then passed those
>> to hbck2.
>> 
>> ----
>> 
>> Am I missing some obvious place where I can use a CLI tool to get a
>> list of RIT? I don't see anything in the ref guide. I looked through
>> the help of HBCK 1 and the shell and couldn't find anything.
>> 
>> I think I can use Admin.getClusterStatus() and getClusterMetrics() to
>> get this info from the Java API. That means there's some way to get it
>> in the hbase shell, but it'll probably be ugly. If there's not already
>> an easier way I'll want to wrap that so it's a simple command.

Reply via email to