Re: Regions Stuck PENDING_OPEN

Austin Heyne Wed, 03 Oct 2018 09:54:05 -0700

Josh: Thanks for all your help! You got us going down a path that leadto a solution.

Thought I would just follow up with what we ended up doing (I do notrecommend anyone attempt this).

When this problems started I'm not sure what issues were wrong with thehbase:meta table but one of the steps we tried was to migrate thedatabase to a new cluster. I believe this cleared up the original issuewe were having but for some reason, perhaps because of the initialproblem, the migration didn't work correctly. This left us withreferences to servers in hbase:meta that didn't exists, for regions thatwere marked as PENDING_OPEN. When HBase would come online it wasbehaving as if it had already asked those region servers to open thoseregions so they were not getting reassigned to good servers. Anothersymptom of this was that there were dead region servers listed in theHBase web UI from the previous cluster (the servers that were listed inthe hbase:meta).

Since in the hbase:meta table regions were marked as PENDING_OPEN onregion servers that didn't exist we were unable to close or move theregions since master couldn't communicate with the region server. Forsome reason hbck -fix wasn't able to repair the assignments or didn'trealize it needed to. This might be due to some other metainconsistencies like overlapping regions or duplicated start keys. I'munsure why it couldn't clear things up.

To repair this we initially backed up the meta directory in s3 whileeverything was offline. Then while HBase was online and the tables weredisabled we used a Scala REPL to rewrite the hbase:meta entries for eacheffected region (~4500 regions); replacing the 'server' and 'sn' withvalid values and setting 'state' to 'OPEN'. We then flushed/compactedthe meta table and took down HBase. After nuking /hbase in ZK we broughteverything back up. There initially was a lot of churn with regionassignments but after things settled everything was online. I think thisworked because of the state the meta table was in when HBase stopped. Ithink it looked like a crash and HBase went through it's normal repaircycle of re-opening regions and using previous assignments.

Like I said, I don't recommend manually rewriting the hbase:meta tablebut it did work for us.


Thanks,
Austin

On 10/01/2018 01:28 PM, Josh Elser wrote:

That seems pretty wrong. The Master should know that old RS's are nolonger alive and not try to assign regions to them. I don't have muchfamiliarity with 1.4 to say if, hypothetically, that might be fixed ina release 1.4.5-1.4.7.
I don't have specific suggestions, but can share how I'd approach it.
I'd pick one specific region and try to trace the logic around justthat one region. Start with the state in hbase:meta -- see if there isa column in meta for this old server. Expand out to WALs in HDFS.Since you can wipe ZK and this still happens, it seems clear it's notcoming from ZK data.
Compare the data you find with what DEBUG logging in the Master says,see if you can figure out some more about how the Master chose to makethe decision it did. That will help lead you to what the appropriate"fix" should be.
On 10/1/18 10:46 AM, Austin Heyne wrote:
I'm running HBase 1.4.4 on EMR. In following your suggestions Irealized that the master is trying to assign the regions todead/non-existant region servers. While trying to fix this problem Ihad killed the EMR cluster and started a new one. It's still tryingto assign some regions to those region servers in the previouscluster. I tried to manually move one of the regions to a good regionserver but I'm getting 'ERROR: No route to host' when I try to closethe region.
I've tried nuking the /hbase directory in Zookeeper but that didn'tseem to help so I'm not sure where it's getting these references from.
-Austin


On 09/30/2018 02:38 PM, Josh Elser wrote:
First off: You're on EMR? What version of HBase you're using? (MaybeZach or Stephen can help here too). Can you figure out theRegionServer(s) which are stuck opening these PENDING_OPEN regions?Can you get a jstack/thread-dump from those RS's?
In terms of how the system is supposed to work: the PENDING_OPENstate for a Region "R" means: the active Master has asked aRegionServer to open R. That RS should have an active thread whichis trying to open R. Upon success, the state of R will move fromPENDING_OPEN to OPEN. Otherwise, the Master will try to assign R again.
In absence of any custom coprocessors (including Phoenix), thiswould mean some subset of RegionServers are in a bad state. Figuringout what those RS's are trying to do will be the next step infiguring out why they're stuck like that. It might be obvious fromthe UI, or you might have to look at hbase:meta or the master log tofigure it out.
One caveat, it's possible that the Master is just not doing theright thing as described above. If the steps described above don'tseem to be matching what your system is doing, you might have tolook closer at the Master log. Make sure you have DEBUG on to getanything of value out of the system.
On 9/30/18 1:43 PM, Austin Heyne wrote:
I'm having a strange problem that my usual bag of tricks is havingtrouble sorting out. On Friday queries stoped returning for somereason. You could see them come in and there would be a resourceutilization spike that would fade out after an appropriate amountof time, however, the query would never actually return. This couldbe related to our client code but I wasn't able to dig into itsince this was the middle of the day on a production system. Sincethis had happened before and bouncing HBase cleared it up, Iproceeded to disable tables and restart HBase. Upon bringing HBasebackup a few thousand regions are stuck in PENDING_OPEN state andrefuse to move from that state. I've run hbck -repair a number oftimes under a few conditions (even the offline repair), havedeleted everything out of /hbase in zookeeper and even migrated thecluster to new servers (EMR) with no luck. When I spin HBase up theregions are already at PENDING_OPEN even though the tables areoffline.
Any ideas on what's going on here would be a huge help.

Thanks,
Austin


--
Austin L. Heyne

Re: Regions Stuck PENDING_OPEN

Reply via email to