The Master stacktrace you have there does read as a bug, but it shouldn't be affecting balancing.

That Chore is doing work to apply space quotas, but your quota here is only doing RPC (throttle) quotas. Might be something already fixed since the version you're on. I'll see if anything jumps out at me on Jira.

If the Master isn't giving you any good logging, you could set the Log4j level to DEBUG for org.apache.hadoop.hbase (either via CM or the HBase UI for the active master, assuming that feature isn't disabled for security reasons in your org -- master.ui.readonly something something config property in hbase-site.xml)

If DEBUG doesn't help, I'd set TRACE level for org.apache.hadoop.hbase.master.balancer. Granted, it might not be obvious to the untrained eye, but if you can share that DEBUG/TRACE after you manuall invoke the balancer again via hbase shell, it should be enough for those watching here.

On 1/11/21 5:32 AM, Marc Hoppins wrote:
OK. So I tried again after running kinit and got the following:

Took 0.0010 seconds
hbase(main):001:0> list_quotas
OWNER                                            QUOTAS
  USER => robot_urlrs                             TYPE => THROTTLE, THROTTLE_TYPE => 
REQUEST_NUMBER, LIMIT => 100req/sec, SCOPE => MACHINE
1 row(s)

Not sure what to make of it but it doesn't seem like it is enough to prevent 
balancing.  There are other tables and (probably) other users.

-----Original Message-----
From: Marc Hoppins <[email protected]>
Sent: Monday, January 11, 2021 9:52 AM
To: [email protected]
Subject: RE: Region server idle

EXTERNAL

I tried. Appears to have failed reading data from hbase:meta. These are 
repeated errors for the whole run of list_quotas.

A balance task was run on Friday. It took 9+ hours. The affected host had 6 
regions - no procedures/locks or processes were running for those 6 regions. 
Today, that host has 8 regions.  No real work being performed on them.  The 
other server - which went idle as a result of removing hbase19 host from hbase 
and re-inserting to hbase - is still doing nothing and has no regions assigned.

I was su - hbase hbase shell to run it.

****************

HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.1.0-cdh6.3.2, rUnknown, Fri Nov  8 05:44:07 PST 2019 Took 0.0011 seconds 
hbase(main):001:0> list_quotas
OWNER                                      QUOTAS
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=8, exceptions:
Mon Jan 11 09:16:46 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, javax.security.sasl.SaslException: Call to 
dr1-hbase18.jumb                                                                
                                                                             
o.hq.com/10.1.140.36:16020 failed on local exception: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provi                                                      
                                                                                
       ded (Mechanism level: Failed to find any Kerberos tgt)] [Caused by 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentia                                                              
                                                                               
ls provided (Mechanism level: Failed to find any Kerberos tgt)]]
Mon Jan 11 09:16:46 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, java.io.IOException: Call to 
dr1-hbase18.jumbo.hq.com/                                                       
                                                                                
      10.1.140.36:16020 failed on local exception: java.io.IOException: Can not 
send request because relogin is in progress.
Mon Jan 11 09:16:46 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, java.io.IOException: Call to 
dr1-hbase18.jumbo.hq.com/                                                       
                                                                                
      10.1.140.36:16020 failed on local exception: java.io.IOException: Can not 
send request because relogin is in progress.
Mon Jan 11 09:16:47 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, java.io.IOException: Call to 
dr1-hbase18.jumbo.hq.com/                                                       
                                                                                
      10.1.140.36:16020 failed on local exception: java.io.IOException: Can not 
send request because relogin is in progress.
Mon Jan 11 09:16:47 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, java.io.IOException: Call to 
dr1-hbase18.jumbo.hq.com/                                                       
                                                                                
      10.1.140.36:16020 failed on local exception: java.io.IOException: Can not 
send request because relogin is in progress.
Mon Jan 11 09:16:48 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, javax.security.sasl.SaslException: Call to 
dr1-hbase18.jumb                                                                
                                                                             
o.hq.com/10.1.140.36:16020 failed on local exception: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provi                                                      
                                                                                
       ded (Mechanism level: Failed to find any Kerberos tgt)] [Caused by 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentia                                                              
                                                                               
ls provided (Mechanism level: Failed to find any Kerberos tgt)]]
Mon Jan 11 09:16:50 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, java.io.IOException: Call to 
dr1-hbase18.jumbo.hq.com/                                                       
                                                                                
      10.1.140.36:16020 failed on local exception: java.io.IOException: Can not 
send request because relogin is in progress.
Mon Jan 11 09:16:54 CET 2021, RpcRetryingCaller{globalStartTime=1610353006298, 
pause=100, maxAttempts=8}, javax.security.sasl.SaslException: Call to 
dr1-hbase18.jumb                                                                
                                                                             
o.hq.com/10.1.140.36:16020 failed on local exception: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provi                                                      
                                                                                
       ded (Mechanism level: Failed to find any Kerberos tgt)] [Caused by 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentia                                                              
                                                                               
ls provided (Mechanism level: Failed to find any Kerberos tgt)]]

-----Original Message-----
From: Stack <[email protected]>
Sent: Saturday, January 9, 2021 1:52 AM
To: Hbase-User <[email protected]>
Subject: Re: Region server idle

EXTERNAL

Looking at code around exception, can you check your quota settings? See 
refguide on how to list quotas. Look for table or namespace that is empty or 
non-existant and fill in missing portion.

This is master-side log? It is from a periodic task so perhaps something else 
is in the way of the non-assign? Anything else in there about balancing or why 
we are skipping assign to these servers? Try a balance run in the shell and 
then check master log to see why no work done?

S

On Fri, Jan 8, 2021 at 2:51 AM Marc Hoppins <[email protected]> wrote:

Apologies again.  Here is the full error message.

2021-01-08 11:34:15,831 ERROR org.apache.hadoop.hbase.ScheduledChore:
Caught error
java.lang.IllegalStateException: Expected only one of namespace and
tablename to be null
         at
org.apache.hadoop.hbase.quotas.SnapshotQuotaObserverChore.getSnapshotsToComputeSize(SnapshotQuotaObserverChore.java:198)
         at
org.apache.hadoop.hbase.quotas.SnapshotQuotaObserverChore._chore(SnapshotQuotaObserverChore.java:126)
         at
org.apache.hadoop.hbase.quotas.SnapshotQuotaObserverChore.chore(SnapshotQuotaObserverChore.java:113)
         at
org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
         at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
         at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
         at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
         at
org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
         at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         at java.lang.Thread.run(Thread.java:748)

-----Original Message-----
From: Marc Hoppins <[email protected]>
Sent: Friday, January 8, 2021 10:57 AM
To: [email protected]
Subject: RE: Region server idle

EXTERNAL

So, I tried decommission that RS and recommission it.  No change.
Server still idle.

Tried decommission another server and see if HBASE sets itself right.
Now I have two RS that are idle.

ba-hbase18.jumbo.hq.com,16020,1604413480001     Tue Nov 03 15:24:40 CET
2020    1 s     2.1.0-cdh6.3.2  13      471
ba-hbase19.jumbo.hq.com,16020,1610095488001     Fri Jan 08 09:44:48 CET
2021    0 s     2.1.0-cdh6.3.2  0       6
ba-hbase20.jumbo.hq.com,16020,1610096850259     Fri Jan 08 10:07:30 CET
2021    0 s     2.1.0-cdh6.3.2  0       0
ba-hbase21.jumbo.hq.com,16020,1604414101652     Tue Nov 03 15:35:01 CET
2020    1 s     2.1.0-cdh6.3.2  15      447

 From the logs:
2021-01-08 10:25:36,875 ERROR org.apache.hadoop.hbase.ScheduledChore:
Caught error java.lang.IllegalStateException: Expected only one of
namespace and tablename to be null

This is reappearing in hbase master log

M

-----Original Message-----
From: Sean Busbey <[email protected]>
Sent: Thursday, January 7, 2021 7:30 PM
To: Hbase-User <[email protected]>
Subject: Re: Region server idle

EXTERNAL

Sounds like https://issues.apache.org/jira/browse/HBASE-24139

The description of that jira has a workaround.

On Thu, Jan 7, 2021, 05:23 Marc Hoppins <[email protected]> wrote:

Hi all,

I have a setup with 67 region servers. 29 Dec one system had to be
shut down to have EMM module swapped out - which took one work day.
Host was back online 30 Dec.

My HBASE is very basic so I appreciate your patience.

My understanding of the defaults that are setup is that a major
compaction should occur every 7 days.  Moreover, do I assume that
more extensive balancing may occur after this happens?

When I check (via hbase master UI) the status of HBASE, I see the
following:

ServerName

Start time

Last contact

Version

Requests Per Second

Num. Regions

ba-hbase16.jumbo.hq.com,16020,1604413068640<
http://ba-hbase16.jumbo.hq.eset.com:16030/rs-status>

Tue Nov 03 15:17:48 CET 2020

3 s

2.1.0-cdh6.3.2

46

462

ba-hbase17.jumbo.hq.com,16020,1604413274393<
http://ba-hbase17.jumbo.hq.eset.com:16030/rs-status>

Tue Nov 03 15:21:14 CET 2020

1 s

2.1.0-cdh6.3.2

19

462

ba-hbase18.jumbo.hq.com,16020,1604413480001<
http://ba-hbase18.jumbo.hq.eset.com:16030/rs-status>

Tue Nov 03 15:24:40 CET 2020

2 s

2.1.0-cdh6.3.2

62

461

ba-hbase19.jumbo.hq.com,16020,1609326754985<
http://ba-hbase19.jumbo.hq.eset.com:16030/rs-status>

Wed Dec 30 12:12:34 CET 2020

2 s

2.1.0-cdh6.3.2

0

0

ba-hbase20.jumbo.hq.com,16020,1604413895967<
http://ba-hbase20.jumbo.hq.eset.com:16030/rs-status>

Tue Nov 03 15:31:35 CET 2020

2 s

2.1.0-cdh6.3.2

62

503

ba-hbase21.jumbo.hq.com,16020,1604414101652<
http://ba-hbase21.jumbo.hq.eset.com:16030/rs-status>

Tue Nov 03 15:35:01 CET 2020

3 s

2.1.0-cdh6.3.2

59

442

ba-hbase22.jumbo.hq.com,16020,1604414308289<
http://ba-hbase22.jumbo.hq.eset.com:16030/rs-status>

Tue Nov 03 15:38:28 CET 2020

0 s

2.1.0-cdh6.3.2

40

438


Why, after more than 7 days, is this host not hosting more (any) regions?
Should I initiate some kind of rebalancing?

Thanks in advance.

M


Reply via email to