[jira] [Created] (HBASE-26449) The way we add or clear failedReplicas may have race

2021-11-10 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-26449:
-

 Summary: The way we add or clear failedReplicas may have race
 Key: HBASE-26449
 URL: https://issues.apache.org/jira/browse/HBASE-26449
 Project: HBase
  Issue Type: Sub-task
  Components: read replicas
Reporter: Duo Zhang


We will add a replica to failedReplicas in the callback of the replay method 
call, and we will clear the failedReplicas in add method when we meet a flush 
all edit.

They are in different threads, so it is possible that, we have already clear 
the failedReplicas due to a flush all edit, then in the callback of replay, we 
add a replica to the failedReplicas because of a failure of replicating, 
although the failure is actually before the flush all edit.

We should find a way to fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26448) Make sure we do not flush a region too frequently

2021-11-10 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-26448:
-

 Summary: Make sure we do not flush a region too frequently
 Key: HBASE-26448
 URL: https://issues.apache.org/jira/browse/HBASE-26448
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang


In HBASE-26412, once we found there is an error replicating the edits, we will 
trigger a flush, and in HBASE-26413, we are likely to also trigger a flush when 
the pending edits are too large.

We should find a way to avoid flushing the region too much.

And a failure of replicating usually means the region is moved or the target 
region server is crashed, only a flush after the secondary replica is online 
will be useful, so flushing the region lots of times before the scedondary 
replica online is just useless.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26311) Balancer gets stuck in cohosted replica distribution

2021-11-10 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-26311.
---
Fix Version/s: 2.5.0
   3.0.0-alpha-2
   2.4.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Balancer gets stuck in cohosted replica distribution
> 
>
> Key: HBASE-26311
> URL: https://issues.apache.org/jira/browse/HBASE-26311
> Project: HBase
>  Issue Type: Bug
>  Components: Balancer
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9
>
>
> In production, we found a corner case where balancer cannot make progress 
> when there is cohosted replica. This is repro'ed on master branch using test 
> added in HBASE-26310.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26327) Replicas cohosted on a rack shouldn't keep triggering Balancer

2021-11-10 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-26327.
---
Fix Version/s: 2.5.0
   3.0.0-alpha-2
   2.4.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Replicas cohosted on a rack shouldn't keep triggering Balancer
> --
>
> Key: HBASE-26327
> URL: https://issues.apache.org/jira/browse/HBASE-26327
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9
>
>
> Currently, Balancer has a shortcut check for cohosted replicas of the same 
> region/host/rack and will keep triggering balancer if it is non-zero.
> With the trending of kube and cloud solution for HBase, operators don't have 
> full control of the topology or are not even aware of the topology. There are 
> cases that it is not possible to satisfy or requires sacrificing other 
> constraints such as region count balancing on RS. We want to keep there per 
> RS/host check for availability of regions, especially for meta region. We 
> haven't heard problem with rack so far. The cost functions will still be 
> considered during balancing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26308) Sum of multiplier of cost functions is not populated properly when we have a shortcut for trigger

2021-11-10 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-26308.
---
Fix Version/s: 2.5.0
   3.0.0-alpha-2
   2.4.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Sum of multiplier of cost functions is not populated properly when we have a 
> shortcut for trigger
> -
>
> Key: HBASE-26308
> URL: https://issues.apache.org/jira/browse/HBASE-26308
> Project: HBase
>  Issue Type: Sub-task
>  Components: Balancer
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Critical
> Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9
>
>
> We have a couple of scenarios that we force balancing:
>  * idle servers
>  * co-hosted regions
> The code path quit before populating the sum of multiplier of cost functions. 
> This causes wrong value reported in logging. As below, the weighted average 
> is not divide by total weight. This causes inconsistent log among iterations.
> {quote}2021-09-24 21:46:57,881 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Running 
> balancer because at least one server hosts replicas of the same region.
> 2021-09-24 21:46:57,881 INFO 
> org.apache.hadoop.hbase.master.balancer.S*tocha*sticLoadBalancer: Start 
> S*tocha*sticLoadBalancer.balancer, initial weighted average 
> imbalance=6389.260497305375, functionCost=RegionCountSkewCostFunction : 
> (multiplier=500.0, imbalance=0.06659036267913739); 
> PrimaryRegionCountSkewCostFunction : (multiplier=500.0, 
> imbalance=0.05296760285663541); MoveCostFunction : (multiplier=7.0, 
> imbalance=0.0, balanced); ServerLocalityCostFunction : (multiplier=25.0, 
> imbalance=0.46286750487559114); RackLocalityCostFunction : (multiplier=15.0, 
> imbalance=0.2569525347374165); TableSkewCostFunction : (multiplier=500.0, 
> imbalance=0.3760689783169534); RegionReplicaHostCostFunction : 
> (multiplier=10.0, imbalance=0.0553889913899139); 
> RegionReplicaRackCostFunction : (multiplier=1.0, 
> imbalance=0.05854089790897909); ReadRequestCostFunction : (multiplier=5.0, 
> imbalance=0.06969346106898068); WriteRequestCostFunction : (multiplier=5.0, 
> imbalance=0.07834116112410174); MemStoreSizeCostFunction : (multiplier=5.0, 
> imbalance=0.12533769793201735); StoreFileCostFunction : (multiplier=5.0, 
> imbalance=0.06921401085082914);  computedMaxSteps=5577401600
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26337) Optimization for weighted random generators

2021-11-10 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-26337.
---
Fix Version/s: 2.5.0
   3.0.0-alpha-2
   2.4.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

Merged to branch-2.4+.

Thanks [~claraxiong]!

> Optimization for weighted random generators
> ---
>
> Key: HBASE-26337
> URL: https://issues.apache.org/jira/browse/HBASE-26337
> Project: HBase
>  Issue Type: Improvement
>  Components: Balancer
>Reporter: Clara Xiong
>Assignee: Clara Xiong
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9
>
>
> Currently we use four move candidate generators and pick one randomly for 
> every move with even probability, all with optimization associated with a 
> certain group of cost functions. We can assign weight for the random picking 
> of generators based on balancing pressure on each group. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26447) Make enableReplicationPeer/disableReplicationPeer idempotent

2021-11-10 Thread Nishtha Shah (Jira)
Nishtha Shah created HBASE-26447:


 Summary: Make enableReplicationPeer/disableReplicationPeer 
idempotent
 Key: HBASE-26447
 URL: https://issues.apache.org/jira/browse/HBASE-26447
 Project: HBase
  Issue Type: Improvement
  Components: Admin
Reporter: Nishtha Shah
Assignee: Nishtha Shah


When enableReplicationPeer is called and the peer is already enabled, 
DoNotRetryIOException is being thrown as part of preEnablePeer 
[here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/ReplicationPeerManager.java#L164],
 similarly with disableReplicationPeer 
[here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/ReplicationPeerManager.java#L171].

java.lang.RuntimeException: org.apache.hadoop.hbase.DoNotRetryIOException: 
Replication peer 1 has already been enabled

Ideally, it should not throw a runtimeException, if the peer is already in 
desired state


Either 1. we should add a check before trying to enable/disable peer, and if it 
is already enabled return, else enable the peer or
2. Log the message instead of throwing exception in preEnablePeer/preDisablePeer



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26446) CellCounter should report serialized cell size counts too

2021-11-10 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-26446:
---

 Summary: CellCounter should report serialized cell size counts too
 Key: HBASE-26446
 URL: https://issues.apache.org/jira/browse/HBASE-26446
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
 Fix For: 2.5.0, 3.0.0-alpha-2






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26445) Procedure state pretty-printing should use toStringBinary not base64 encoding

2021-11-10 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-26445:
---

 Summary: Procedure state pretty-printing should use toStringBinary 
not base64 encoding
 Key: HBASE-26445
 URL: https://issues.apache.org/jira/browse/HBASE-26445
 Project: HBase
  Issue Type: Task
Affects Versions: 2.4.8
Reporter: Andrew Kyle Purtell
 Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9


The shell 'list_procedures' command produces output like:

 889 org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure 
SUCCESS 2021-11-10 22:20:34 UTC 2021-11-10 22:20:35 UTC [{"state"=>[1, 2, 3, 
11, 4, 5, 6, 7, 8, 9, 10, 2147483648]}, {"regionId"=>"1636579678894", 
"tableName"=>{"namespace"=>"ZGVmYXVsdA==", 
"qualifier"=>"SW50ZWdyYXRpb25UZXN0TG9hZENvbW1vbkNyYXds"}, 
"startKey"=>"dWsuY28uZ3Jhbml0ZXRyYW5zZm9ybWF0aW9ucy53d3d8L2dhbGxlcnkvdA==", 
"endKey"=>"dXMuYmFuZHwvYmFuZC81OA==", "offline"=>false, "split"=>false, 
"replicaId"=>0}, {"userInfo"=>{"effectiveUser"=>"apurtell"}, 
"parentRegionInfo"=>{"regionId"=>"1636579678894", 
"tableName"=>{"namespace"=>"ZGVmYXVsdA==", 
"qualifier"=>"SW50ZWdyYXRpb25UZXN0TG9hZENvbW1vbkNyYXds"}, 
"startKey"=>"dWsuY28uZ3Jhbml0ZXRyYW5zZm9ybWF0aW9ucy53d3d8L2dhbGxlcnkvdA==", 
"endKey"=>"dXMuYmFuZHwvYmFuZC81OA==", "offline"=>false, "split"=>false, 
"replicaId"=>0}, "childRegionInfo"=>[{"regionId"=>"1636582834759", 
"tableName"=>{"namespace"=>"ZGVmYXVsdA==", 
"qualifier"=>"SW50ZWdyYXRpb25UZXN0TG9hZENvbW1vbkNyYXds"}, 
"startKey"=>"dWsuY28uZ3Jhbml0ZXRyYW5zZm9ybWF0aW9ucy53d3d8L2dhbGxlcnkvdA==", 
"endKey"=>"dWsuY28uc2ltb25hbmRzY2h1c3Rlci53d3d8L2Jvb2tzL1RoZS1P", 
"offline"=>false, "split"=>false, "replicaId"=>0}, 
{"regionId"=>"1636582834759", "tableName"=>{"namespace"=>"ZGVmYXVsdA==", 
"qualifier"=>"SW50ZWdyYXRpb25UZXN0TG9hZENvbW1vbkNyYXds"}, 
"startKey"=>"dWsuY28uc2ltb25hbmRzY2h1c3Rlci53d3d8L2Jvb2tzL1RoZS1P", 
"endKey"=>"dXMuYmFuZHwvYmFuZC81OA==", "offline"=>false, "split"=>false, 
"replicaId"=>0}]}]

The base64 encoding of byte[] values produces poor usability. It would be 
better to use Bytes.toStringBinary. Generally, table names etc are printable 
characters encoded in byte[]. Base64 encoding them totally obfuscates 
information that is important to see at a glance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26444) BucketCacheWriter should log only the BucketAllocatorException message, not the full stack trace

2021-11-10 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-26444:
---

 Summary: BucketCacheWriter should log only the 
BucketAllocatorException message, not the full stack trace
 Key: HBASE-26444
 URL: https://issues.apache.org/jira/browse/HBASE-26444
 Project: HBase
  Issue Type: Task
Affects Versions: 2.4.8
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
 Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9


We have recently improved the logging of BucketAllocatorException to not 
overwhelm the log if there are a lot of blocks that are too large, but when we 
do log it, we still log the full stack trace. That is wasteful. The messages 
always come from the same location in source code. The stack trace provides no 
value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26443) Some BaseLoadBalancer log lines should be at DEBUG level

2021-11-10 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-26443:
---

 Summary: Some BaseLoadBalancer log lines should be at DEBUG level
 Key: HBASE-26443
 URL: https://issues.apache.org/jira/browse/HBASE-26443
 Project: HBase
  Issue Type: Task
Affects Versions: 2.4.9
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
 Fix For: 2.5.0, 3.0.0-alpha-2, 2.4.9


These are printed per chore run, per host:

[00]2021-11-10 22:07:32,984 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
balancer.BaseLoadBalancer: Start Generate Balance plan for cluster.
[00]2021-11-10 22:07:32,998 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
balancer.BaseLoadBalancer: server  is on rack 


One log line per server. On a large cluster, this is a lot of unnecessary 
logging. 

The 'Start Generate Balance plan for cluster.' log line should be at DEBUG.


The 'server  is on rack ' log line arguably should be at TRACE. It will 
never change. It is not interesting unless you are debugging the balancer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26442) TestReplicationEndpoint#testInterClusterReplication fails in branch-1

2021-11-10 Thread Rushabh Shah (Jira)
Rushabh Shah created HBASE-26442:


 Summary: TestReplicationEndpoint#testInterClusterReplication fails 
in branch-1
 Key: HBASE-26442
 URL: https://issues.apache.org/jira/browse/HBASE-26442
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.7.1
Reporter: Rushabh Shah
Assignee: Rushabh Shah


{noformat}
[INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ hbase-server ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running org.apache.hadoop.hbase.replication.TestReplicationEndpoint
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 20.978 
s <<< FAILURE! - in org.apache.hadoop.hbase.replication.TestReplicationEndpoint
[ERROR] org.apache.hadoop.hbase.replication.TestReplicationEndpoint  Time 
elapsed: 3.921 s  <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.hbase.replication.TestReplicationEndpoint.tearDownAfterClass(TestReplicationEndpoint.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestReplicationEndpoint.tearDownAfterClass:88
[INFO] 
[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 0
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26384) Segment already flushed to hfile may still be remained in CompactingMemStore

2021-11-10 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell resolved HBASE-26384.
-
Resolution: Fixed

> Segment already flushed to hfile may still be remained in CompactingMemStore 
> -
>
> Key: HBASE-26384
> URL: https://issues.apache.org/jira/browse/HBASE-26384
> Project: HBase
>  Issue Type: Bug
>  Components: in-memory-compaction
>Affects Versions: 3.0.0-alpha-1, 2.4.8
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>  Labels: branch-2
> Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> When  {{CompactingMemStore}} prepares flushing, 
> {{CompactingMemStore.snapshot}} invokes  following 
> {{CompactingMemStore.pushPipelineToSnapshot}} method to get {{Snapshot}}, 
> following line 570 and line 575 uses {{CompactionPipeline#version}} to track 
> whether the Segments in {{CompactionPipeline#pipeline}} has changed since it 
> gets {{VersionedSegmentsList}}  in line 570 before emptying 
> {{CompactionPipeline#pipeline}} in line 575.  
>   {code:java}
>   565private void pushPipelineToSnapshot() {
>   566int iterationsCnt = 0;
>   567boolean done = false;
>   568while (!done) {
>   569  iterationsCnt++;
>   570  VersionedSegmentsList segments = 
> pipeline.getVersionedList();
>   571  pushToSnapshot(segments.getStoreSegments());
>   572  // swap can return false in case the pipeline was updated 
> by ongoing compaction
>   573 // and the version increase, the chance of it happenning is 
> very low
>   574 // In Swap: don't close segments (they are in snapshot now) 
> and don't update the region size
>   575done = pipeline.swap(segments, null, false, false);
> ...
>   }
>{code}
> However, when {{CompactingMemStore#inMemoryCompaction}} executes 
> {{CompactionPipeline#flattenOneSegment}}, it does not change  
> {{CompactionPipeline#version}} , if there is an  {{in memeory compaction}} 
> which executes  {{CompactingMemStore#flattenOneSegment}} between above line 
> 570 and line 575, the  {{CompactionPipeline#version}} not change, but the 
> {{Segment}} in {{CompactionPipeline}} has changed.  Because 
> {{CompactionPipeline#version}} not change,  {{pipeline.swap}} in above line 
> 575 could think it is safe to invoke following 
> {{CompactionPipeline#swapSuffix}} method to remove {{Segment}} in 
> {{CompactionPipeline}} , but the {{Segment}} in {{CompactionPipeline}} has 
> changed because of {{CompactingMemStore#flattenOneSegment}} , so the 
> {{Segment}} not removed in following line 295 and still remaining in 
> {{CompactionPipeline}}. 
>   {code:java}
>   293  private void swapSuffix(List suffix, 
> ImmutableSegment segment,
>   294 boolean closeSegmentsInSuffix) {
>   295  pipeline.removeAll(suffix);
>   296  if(segment != null) pipeline.addLast(segment);
>  
> {code}
> However {{CompactingMemStore.snapshot}} think it is successful and continues 
> to flush the {{Segment}} got by {{CompactingMemStore.snapshot}}  as normal, 
> but the {{Segment}} with the same cells still be left in 
> {{CompactingMemStore}}. Leaving {{Segment}} which already flushed in 
> {{MemStore}} is dangerous: if a Major Compaction before the left {{Segment}} 
> flushing, there may be data erroneous.
> My Fix in the PR is as following:
> # Increasing the {{CompactionPipeline#version}}  in 
> {{CompactingMemStore#flattenOneSegment}} .
>Branch-2 has this problem but master not, because the branch-2 patch for 
> HBASE-18375 omitting this. 
> # For {{CompactionPipeline#swapSuffix}}  , explicitly checking that the 
> {{Segment}} in {{suffix}} input parameter is same as the {{Segment}} in 
> {{pipeline}} one by one from 
>the last element to the first element of {{suffix}} , I think explicitly 
> throwing Exception is better than hiding error and causing  subtle problem.
> I made separate PRs for master and branch-2 so the code for master and 
> brach-2 could consistent and master could also has UTs for this problem.
> [PR#3777|https://github.com/apache/hbase/pull/3777] is for master and 
> [PR#3779|https://github.com/apache/hbase/pull/3779] is for branch-2.The 
> difference between them is patch for brach-2 including following code in 
> {{CompactionPipeline.replaceAtIndex}} which not included in  branch-2 patch 
> for HBASE-18375:
> {code:java}
> // the version increment is indeed needed, because the swap uses 
> removeAll() method of the
> // linked-list that compares the objects to find what to remove.
> // The flattening changes the segment object completely (creation 
> pattern) and so
> // swap will not proceed 

[jira] [Created] (HBASE-26441) Add metrics for BrokenStoreFileCleaner

2021-11-10 Thread Szabolcs Bukros (Jira)
Szabolcs Bukros created HBASE-26441:
---

 Summary: Add metrics for BrokenStoreFileCleaner
 Key: HBASE-26441
 URL: https://issues.apache.org/jira/browse/HBASE-26441
 Project: HBase
  Issue Type: Sub-task
  Components: metrics
Reporter: Szabolcs Bukros
Assignee: Szabolcs Bukros


This is a followup for HBASE-26271.
Cleaner chores lacking visibility is returning issue so I would like to add 
metrics for BrokenStoreFileCleaner to have a better idea of the tasks it 
performs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26440) Region opens sometimes fail when bucket cache is in use

2021-11-10 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-26440:
---

 Summary: Region opens sometimes fail when bucket cache is in use
 Key: HBASE-26440
 URL: https://issues.apache.org/jira/browse/HBASE-26440
 Project: HBase
  Issue Type: Bug
  Components: BlockCache, BucketCache, regionserver
Affects Versions: 2.4.8
Reporter: Andrew Kyle Purtell


After a split, I think, the region is reopened, and:

2021-11-10 16:47:48,929 ERROR 
[RS_OPEN_REGION-regionserver/ip-172-31-63-83:8120-1] regionserver.HRegion: 
Could not initialize all stores for the 
region=IntegrationTestLoadCommonCrawl,,1636562865609.e2df2061bcdc037070041de734a187ff.

Caused by: java.io.IOException: java.lang.RuntimeException: Cached block 
contents differ, which should not have happened.
cacheKey:1e395fa6f0584ebc9fd831f1451dc4ba.cbb4e3a74e703805898a2a3bf0af94e7_48543272
        at 
org.apache.hadoop.hbase.regionserver.HStore.openStoreFiles(HStore.java:577)
        at 
org.apache.hadoop.hbase.regionserver.HStore.loadStoreFiles(HStore.java:534)
        at org.apache.hadoop.hbase.regionserver.HStore.(HStore.java:298)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:6546)

 

Caused by: java.lang.RuntimeException: Cached block contents differ, which 
should not have happened.
cacheKey:1e395fa6f0584ebc9fd831f1451dc4ba.cbb4e3a74e703805898a2a3bf0af94e7_48543272
        at 
org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.validateBlockAddition(BlockCacheUtil.java:205)
        at 
org.apache.hadoop.hbase.io.hfile.BlockCacheUtil.shouldReplaceExistingCacheBlock(BlockCacheUtil.java:237)
        at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.shouldReplaceExistingCacheBlock(BucketCache.java:446)
        at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlockWithWait(BucketCache.java:431)
        at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.cacheBlock(BucketCache.java:417)
        at 
org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:64)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.lambda$readBlock$2(HFileReaderImpl.java:1346)


The open fails but another attempt somewhere else succeeds.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26439) improve upgrading doc

2021-11-10 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-26439.
---
Fix Version/s: 3.0.0-alpha-2
 Hadoop Flags: Reviewed
   Resolution: Fixed

Merged to master.

Thanks [~philipse] for contributing!

> improve upgrading doc
> -
>
> Key: HBASE-26439
> URL: https://issues.apache.org/jira/browse/HBASE-26439
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.4.8
>Reporter: guo
>Assignee: guo
>Priority: Minor
> Fix For: 3.0.0-alpha-2
>
>
> improve upgrading doc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26439) improve upgrading doc

2021-11-10 Thread guo (Jira)
guo created HBASE-26439:
---

 Summary: improve upgrading doc
 Key: HBASE-26439
 URL: https://issues.apache.org/jira/browse/HBASE-26439
 Project: HBase
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.4.8
Reporter: guo


improve upgrading doc



--
This message was sent by Atlassian Jira
(v8.20.1#820001)