[jira] [Resolved] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

2021-03-25 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada resolved HBASE-25032.
--
Resolution: Fixed

Great work on this [~caroliney14], thanks for patiently addressing the review 
comments.

> Wait for region server to become online before adding it to online servers in 
> Master
> 
>
> Key: HBASE-25032
> URL: https://issues.apache.org/jira/browse/HBASE-25032
> Project: HBase
>  Issue Type: Bug
>Reporter: Sandeep Guggilam
>Assignee: Caroline
>Priority: Major
>  Labels: master, regionserver
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.3
>
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges 
> the request and adds it to the onlineServers list for further assigning any 
> regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS 
> does a bunch of stuff like initializing replication sources etc before 
> becoming online. However, sometimes there could be an issue with initializing 
> replication sources when it is unable to connect to peer clusters because of 
> some kerberos configuration and there would be a delay of around 20 mins in 
> becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails 
> with ServerNotRunningYet exception, then the master tries to unassign which 
> again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the 
> assignment requests before adding it to online servers list which would 
> account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Split Meta Design Reset Status

2021-03-25 Thread Stack
A few of us met this afternoon to talk some split-meta design. It was Zach,
Bharath, Francis, and myself. Duo showed up 'on-time' but we all were on
daylight savings so we were an hour early! (My fault).

Most of this kick-off meeting was Francis and I rehearsing where we (Duo,
Francis, and I) have gotten to-date (see [1]), the current thinking and
implementation ideas.

We went over the ConnectionRegistry API, and how it might change and we
pulled up RegionLocator too...

Talked about possibly moving ConnectionRegistry impl to a set of
'bootstrapping' RegionServers...

Suggestion was that we break up the problem into aspects to make the
problem more tractable.

We'll try another in a week or so but meantime lets try and make some
progress on the design doc...

Thanks,
S

1.
https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit#heading=h.hdf0rnyevxz2




On Mon, Mar 22, 2021 at 4:28 PM Stack  wrote:

> Now the requirements are in [1], we're going to move to the next stage --
> actual design for split-meta -- and have set up a chat for this thursday
> afternoon (4PM California time/8AM Beijing time) to get the ball rolling.
> Please come if interested. Zoom details are below.
>
> Yours,
> S
> 1.
> https://docs.google.com/document/d/11ChsSb2LGrSzrSJz8pDCAw5IewmaMV0ZDN1LrMkAj4s/edit#heading=h.hdf0rnyevxz2
>
>
> Topic: hbase split-meta design warmup chat
> Time: Mar 25, 2021 04:00 PM Pacific Time (US and Canada)
>
> Join Zoom Meeting
> https://us04web.zoom.us/j/75988003798?pwd=Wi9mU0w0T2ZjTFNBaE9lUmtTbHRpQT09
>
> Meeting ID: 759 8800 3798
> Passcode: hbase
>
>
> On Tue, Jan 5, 2021 at 9:13 AM Stack  wrote:
>
>> FYI, a few of us have been working on the redo/reset of the split meta
>> design (HBASE-25382). We (think we've) finished the requirements. Are there
>> any others to consider?
>>
>> Feedback and contribs welcome. Otherwise, on to the next phase -- design.
>>
>> Thanks,
>> S
>>
>


[jira] [Created] (HBASE-25699) UnsupportedOperationException in DumpClusterStatusAction

2021-03-25 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-25699:
---

 Summary: UnsupportedOperationException in DumpClusterStatusAction
 Key: HBASE-25699
 URL: https://issues.apache.org/jira/browse/HBASE-25699
 Project: HBase
  Issue Type: Bug
  Components: integration tests
Affects Versions: 1.7.0
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
 Fix For: 1.7.0


2021-03-26 00:09:59,443 ERROR [main] util.AbstractHBaseTool: Error running 
command-line tool
java.lang.UnsupportedOperationException
at 
java.util.Collections$UnmodifiableCollection.addAll(Collections.java:1067)
at 
org.apache.hadoop.hbase.chaos.actions.DumpClusterStatusAction.collectKnownRegionServers(DumpClusterStatusAction.java:67)
at 
org.apache.hadoop.hbase.chaos.actions.DumpClusterStatusAction.init(DumpClusterStatusAction.java:49)
at 
org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.init(PeriodicRandomActionPolicy.java:70)
at 
org.apache.hadoop.hbase.chaos.monkies.PolicyBasedChaosMonkey.start(PolicyBasedChaosMonkey.java:127)
at 
org.apache.hadoop.hbase.IntegrationTestBase.startMonkey(IntegrationTestBase.java:199)
at 
org.apache.hadoop.hbase.IntegrationTestBase.setUpMonkey(IntegrationTestBase.java:189)
at 
org.apache.hadoop.hbase.IntegrationTestBase.setUp(IntegrationTestBase.java:170)
at 
org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:152)
at 
org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:106)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at 
org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList.main(IntegrationTestBigLinkedList.java:1990)
[



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25698) Persistent IllegalReferenceCountException at scanner open

2021-03-25 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-25698:
---

 Summary: Persistent IllegalReferenceCountException at scanner open
 Key: HBASE-25698
 URL: https://issues.apache.org/jira/browse/HBASE-25698
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.2
Reporter: Andrew Kyle Purtell
 Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3


Persistent scanner open failure. Not sure how it happened. Test scenario was 
HBase 1 cluster replicating to HBase 2 cluster. ITBLL as data generator at 
source, calm policy only. Scanner open errors on sink HBase 2 cluster later 
during ITBLL verify phase. Sink schema settings bloom=ROW encoding=FAST_DIFF 
compression=NONE.

{noformat}
Caused by: 
org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException: 
refCnt: 0, decrement: 1
at 
org.apache.hbase.thirdparty.io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:74)
at 
org.apache.hbase.thirdparty.io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:138)
at 
org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.release(AbstractReferenceCounted.java:76)
at org.apache.hadoop.hbase.nio.ByteBuff.release(ByteBuff.java:79)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.release(HFileBlock.java:429)
at 
org.apache.hadoop.hbase.io.hfile.CompoundBloomFilter.contains(CompoundBloomFilter.java:109)
at 
org.apache.hadoop.hbase.regionserver.StoreFileReader.checkGeneralBloomFilter(StoreFileReader.java:433)
at 
org.apache.hadoop.hbase.regionserver.StoreFileReader.passesGeneralRowBloomFilter(StoreFileReader.java:322)
at 
org.apache.hadoop.hbase.regionserver.StoreFileReader.passesBloomFilter(StoreFileReader.java:251)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.shouldUseScanner(StoreFileScanner.java:491)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.selectScannersFrom(StoreScanner.java:471)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.(StoreScanner.java:249)
at 
org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2177)
at 
org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2168)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:7172)
{noformat}

Bloom filter type on all files here is ROW, block encoding is FAST_DIFF:

{noformat}
hbase:017:0> describe "IntegrationTestBigLinkedList"
Table IntegrationTestBigLinkedList is ENABLED   

IntegrationTestBigLinkedList

COLUMN FAMILIES DESCRIPTION 

{NAME => 'big', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DIF
F', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 
'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'} 
{NAME => 'meta', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DI
FF', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
=> 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'}
{NAME => 'tiny', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', 
KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'FAST_DI
FF', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE 
=> 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '1'}
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25697) StochasticBalancer improvement for large scale clusters

2021-03-25 Thread Clara Xiong (Jira)
Clara Xiong created HBASE-25697:
---

 Summary: StochasticBalancer improvement for large scale clusters
 Key: HBASE-25697
 URL: https://issues.apache.org/jira/browse/HBASE-25697
 Project: HBase
  Issue Type: Improvement
  Components: Balancer, master, UI
Reporter: Clara Xiong


h2. Findings on a large scale cluster (100,000 regions on 300 nodes)
 * Balancer starts and stops before getting a plan
 * Adding new racks doesn’t trigger balancer
 * Balancer stops leaving some racks at 50% lower region counts
 * Regions for large tables don’t get evenly distributed
 * Observability is poor
 * Too many knobs makes tuning empirical and takes many experiments

h2. Improvements made and bing made
 * Cost function enhancement to capture outliers especially table skew. 
https://issues.apache.org/jira/browse/HBASE-25625?filter=-2 
 * Explain why balancer stops https://issues.apache.org/jira/browse/HBASE-25666 
will back port too https://issues.apache.org/jira/browse/HBASE-24528

h2. More proposals
 * minCostNeedBalance for each cost function instead of weights. We want to 
trigger balancing if any factor is out of balancer instead of trying to combine 
the factors in arbitrary weights. This makes operation and configuration much 
easier.
 * Simulated annealing to lower minCostNeedBalance periodically to unstuck the 
balancer from sub-optimum then gradually increase to keep the system stable. 
Also add cost of move as a counter measure for the decision 
[https://opensourcelibs.com/lib/tempest]
 * Orchestrated scheduling of compaction, normalizer and balancer
 * PID approach [https://www.amazon.com/dp/1449361692/ref=rdr_ext_tmb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25480) NPE when getting metrics of backup master

2021-03-25 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell resolved HBASE-25480.
-
Resolution: Duplicate

> NPE when getting metrics of backup master
> -
>
> Key: HBASE-25480
> URL: https://issues.apache.org/jira/browse/HBASE-25480
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Andrey Elenskiy
>Assignee: Anjan Das
>Priority: Major
>  Labels: JMX, NullPointerException, master
>
> Getting NullPointerException in MetricsMasterWrapperImpl.getMergePlanCount() 
> when getting metrics via JMX on backup master. It appears due to the fact 
> that regionNormalizerManager is null in backup masters as it's only 
> initialized by HMaster.finishActiveMasterInitialization().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-25590) Bulkload replication HFileRefs cannot be cleared in some cases where set exclude-namespace/exclude-table-cfs

2021-03-25 Thread Huaxiang Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-25590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxiang Sun resolved HBASE-25590.
--
Resolution: Fixed

Resolving it for 2.3.5 release. Please reopen when landing the 2.2 patch.

> Bulkload replication HFileRefs cannot be cleared in some cases where set 
> exclude-namespace/exclude-table-cfs
> 
>
> Key: HBASE-25590
> URL: https://issues.apache.org/jira/browse/HBASE-25590
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1
>Reporter: Sun Xin
>Assignee: Sun Xin
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.5
>
>
> In 
> [ReplicationSource#addHFileRefs|https://github.com/apache/hbase/blob/ed90a14995acd87111d2b9849f07d84418ca43d4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L264],
>  we may add unwanted hfiles to the _HFileRefs_ if a peer is set 
> _replicate_all_ true and set _exclude-namespace/exclude-table-cfs_.
> These unwanted _HFileRefs_ will not be replicated to remote cluster and not 
> be cleared.
> Two problems are caused by this bug:
>  # The metric sizeOfHFileRefsQueue cannot be zeroed.
>  # Referenced HFiles cannot be deleted by _ReplicationHFileCleaner._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-25696) Need to initialize SLF4JBridgeHandler in jul-to-slf4j for redirecting jul to slf4j

2021-03-25 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-25696:
-

 Summary: Need to initialize SLF4JBridgeHandler in jul-to-slf4j for 
redirecting jul to slf4j
 Key: HBASE-25696
 URL: https://issues.apache.org/jira/browse/HBASE-25696
 Project: HBase
  Issue Type: Sub-task
  Components: logging
Reporter: Duo Zhang
Assignee: Duo Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)