[jira] [Commented] (HBASE-23881) Netty SASL implementation does not wait for challenge response causing TestShadeSaslAuthenticationProvider failures

2023-09-25 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768763#comment-17768763
 ] 

Josh Elser commented on HBASE-23881:


I'll see if I have any time after work to backport this to branch-2, but I 
would not wait for me to do it if this is actively broken :) 

> Netty SASL implementation does not wait for challenge response causing 
> TestShadeSaslAuthenticationProvider failures
> ---
>
> Key: HBASE-23881
> URL: https://issues.apache.org/jira/browse/HBASE-23881
> Project: HBase
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0-alpha-1, 2.3.0
>Reporter: Bharath Vissapragada
>Assignee: Josh Elser
>Priority: Major
> Fix For: 3.0.0-alpha-1
>
>
> TestShadeSaslAuthenticationProvider now fails deterministically with the 
> following exception..
> {noformat}
> java.lang.Exception: Unexpected exception, 
> expected but 
> was
>   at 
> org.apache.hadoop.hbase.security.provider.example.TestShadeSaslAuthenticationProvider.testNegativeAuthentication(TestShadeSaslAuthenticationProvider.java:233)
> {noformat}
> The test now fails a different place than before merging HBASE-18095 because 
> the RPCs are also a part of connection setup. We might need to rewrite the 
> test..  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27486) HTable MetricsTableLatencies not remove trigger memory leak

2022-11-16 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634939#comment-17634939
 ] 

Josh Elser commented on HBASE-27486:


Nope. I'd suggest you try to write a unit test to demonstrate this happening if 
you plan to fix it. Related, if you're still using HBase 2.0, I think you have 
much larger problems as that has been EOL for quite some time.

> HTable MetricsTableLatencies not remove trigger memory leak 
> 
>
> Key: HBASE-27486
> URL: https://issues.apache.org/jira/browse/HBASE-27486
> Project: HBase
>  Issue Type: Bug
>  Components: metrics, regionserver
>Affects Versions: 2.0.0
>Reporter: Moran
>Priority: Major
>
> MetricsTableLatenciesImpl  histogramsByTable only put but not remove.Maybe we 
> should remove it when table disabled.
> supplement:
> MetricsTableQueryMeterImpl metersByTable has the same problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Please add me to Jira

2022-07-01 Thread Josh Elser

You're now a contributor :)

On 6/27/22 10:19 AM, Luca Kovács wrote:

Hello,

My name is Luca, and I would like to contribute to the Apache project.
Please add me to HBase Jira project.

My username is: lkovacs

Many thanks,
Luca



[jira] [Resolved] (HBASE-20951) Ratis LogService backed WALs

2022-06-13 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-20951.

Resolution: Later

> Ratis LogService backed WALs
> 
>
> Key: HBASE-20951
> URL: https://issues.apache.org/jira/browse/HBASE-20951
> Project: HBase
>  Issue Type: New Feature
>  Components: wal
>    Reporter: Josh Elser
>Priority: Major
>
> Umbrella issue for the Ratis+WAL work:
> Design doc: 
> [https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit#|https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit]
> The (over-simplified) goal is to re-think the current WAL APIs we have now, 
> ensure that they are de-coupled from the notion of being backed by HDFS, swap 
> the current implementations over to the new API, and then wire up the Ratis 
> LogService to the new WAL API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-20951) Ratis LogService backed WALs

2022-06-13 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-20951.

Resolution: Later

> Ratis LogService backed WALs
> 
>
> Key: HBASE-20951
> URL: https://issues.apache.org/jira/browse/HBASE-20951
> Project: HBase
>  Issue Type: New Feature
>  Components: wal
>    Reporter: Josh Elser
>Priority: Major
>
> Umbrella issue for the Ratis+WAL work:
> Design doc: 
> [https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit#|https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit]
> The (over-simplified) goal is to re-think the current WAL APIs we have now, 
> ensure that they are de-coupled from the notion of being backed by HDFS, swap 
> the current implementations over to the new API, and then wire up the Ratis 
> LogService to the new WAL API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-20951) Ratis LogService backed WALs

2022-06-13 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553775#comment-17553775
 ] 

Josh Elser commented on HBASE-20951:


{quote}I am thinking of resolving this and all subtasks as WontFix or Abandoned.
{quote}
Yeah, I think a "Later" is appropriate. Let me do this.

> Ratis LogService backed WALs
> 
>
> Key: HBASE-20951
> URL: https://issues.apache.org/jira/browse/HBASE-20951
> Project: HBase
>  Issue Type: New Feature
>  Components: wal
>Reporter: Josh Elser
>Priority: Major
>
> Umbrella issue for the Ratis+WAL work:
> Design doc: 
> [https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit#|https://docs.google.com/document/d/1Su5py_T5Ytfh9RoTTX2s20KbSJwBHVxbO7ge5ORqbCk/edit]
> The (over-simplified) goal is to re-think the current WAL APIs we have now, 
> ensure that they are de-coupled from the notion of being backed by HDFS, swap 
> the current implementations over to the new API, and then wire up the Ratis 
> LogService to the new WAL API.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-08 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551703#comment-17551703
 ] 

Josh Elser commented on HBASE-26708:


{quote}Any luck with reproducing this on test clusters
{quote}
No, sadly, but we did have another report of it against HBase 2.2ish.
{quote}Does increase MaxDirectMemorySize can solve the problem?
{quote}
In my experience, no. It has crept up constantly to whatever limit we've met in 
cases where a user is hitting it. I think we saw this going up to 50G, but, I 
know we were also not running against the latest from 2.2.x so I can't say we 
weren't hitting already-fixed bugs.
{quote}And what I mean is do you only hit this problem when sasl authentication 
is enabled? This is also an important information, as we will setup more netty 
handlers when sasl is enabled, which may not be covered too much in our tests.
{quote}
SASL has been enabled in the cases we've seen it (users I talk to are rarely 
running without security on these days).

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Major
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.Abstra

[jira] [Resolved] (HBASE-27042) hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut

2022-05-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-27042.

Hadoop Flags: Reviewed
Release Note: Adds support for Apache Hadoop 3.3.3 and removes S3Guard 
vestiges.
  Resolution: Fixed

Thanks Steve!

> hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut
> ---
>
> Key: HBASE-27042
> URL: https://issues.apache.org/jira/browse/HBASE-27042
> Project: HBase
>  Issue Type: Bug
>  Components: hboss
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: hbase-filesystem-1.0.0-alpha2
>
>
> HBoss doesn't compile against hadoop builds containing HADOOP-17409, "remove 
> s3guard", as test setup tries to turn it off.
> there's no need for s3guard any more, so hboss can just avoid all settings 
> and expect it to be disabled (hadoop 3.3.3. or earlier) or removed (3.4+)
> (hboss version is 1.0.0-alpha2-SNAPSHOT)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HBASE-27042) hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut

2022-05-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-27042.

Hadoop Flags: Reviewed
Release Note: Adds support for Apache Hadoop 3.3.3 and removes S3Guard 
vestiges.
  Resolution: Fixed

Thanks Steve!

> hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut
> ---
>
> Key: HBASE-27042
> URL: https://issues.apache.org/jira/browse/HBASE-27042
> Project: HBase
>  Issue Type: Bug
>  Components: hboss
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: hbase-filesystem-1.0.0-alpha2
>
>
> HBoss doesn't compile against hadoop builds containing HADOOP-17409, "remove 
> s3guard", as test setup tries to turn it off.
> there's no need for s3guard any more, so hboss can just avoid all settings 
> and expect it to be disabled (hadoop 3.3.3. or earlier) or removed (3.4+)
> (hboss version is 1.0.0-alpha2-SNAPSHOT)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HBASE-27042) hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut

2022-05-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser reassigned HBASE-27042:
--

Assignee: Steve Loughran  (was: Steve Loughran)

> hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut
> ---
>
> Key: HBASE-27042
> URL: https://issues.apache.org/jira/browse/HBASE-27042
> Project: HBase
>  Issue Type: Bug
>  Components: hboss
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> HBoss doesn't compile against hadoop builds containing HADOOP-17409, "remove 
> s3guard", as test setup tries to turn it off.
> there's no need for s3guard any more, so hboss can just avoid all settings 
> and expect it to be disabled (hadoop 3.3.3. or earlier) or removed (3.4+)
> (hboss version is 1.0.0-alpha2-SNAPSHOT)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HBASE-27042) hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut

2022-05-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-27042:
---
Fix Version/s: hbase-filesystem-1.0.0-alpha2

> hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut
> ---
>
> Key: HBASE-27042
> URL: https://issues.apache.org/jira/browse/HBASE-27042
> Project: HBase
>  Issue Type: Bug
>  Components: hboss
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: hbase-filesystem-1.0.0-alpha2
>
>
> HBoss doesn't compile against hadoop builds containing HADOOP-17409, "remove 
> s3guard", as test setup tries to turn it off.
> there's no need for s3guard any more, so hboss can just avoid all settings 
> and expect it to be disabled (hadoop 3.3.3. or earlier) or removed (3.4+)
> (hboss version is 1.0.0-alpha2-SNAPSHOT)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HBASE-27042) hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut

2022-05-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser reassigned HBASE-27042:
--

Assignee: Steve Loughran

> hboss doesn't compile against hadoop branch-3.3 now that s3guard is cut
> ---
>
> Key: HBASE-27042
> URL: https://issues.apache.org/jira/browse/HBASE-27042
> Project: HBase
>  Issue Type: Bug
>  Components: hboss
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> HBoss doesn't compile against hadoop builds containing HADOOP-17409, "remove 
> s3guard", as test setup tries to turn it off.
> there's no need for s3guard any more, so hboss can just avoid all settings 
> and expect it to be disabled (hadoop 3.3.3. or earlier) or removed (3.4+)
> (hboss version is 1.0.0-alpha2-SNAPSHOT)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-27044) Serialized procedures which point to users from other Kerberos domains can prevent master startup

2022-05-16 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537812#comment-17537812
 ] 

Josh Elser commented on HBASE-27044:


We could do a pretty naive "change" here where we just return back a {{User}} 
which is "unknown" when we fail to parse the serialized protobuf which would be 
enough to fix this problem on the surface.

However, I think this change is missing the root of the problem (the 
expectation that HBase should just be able to "reattach" itself to an 
hbase.rootdir).

I can't think of any way in which the above exception would be thrown other 
than the cloud storage reattachment case I described. I'm happy to put up a 
patch to gracefully handle a the failure to create the UGI if folks think there 
is merit in that.

> Serialized procedures which point to users from other Kerberos domains can 
> prevent master startup
> -
>
> Key: HBASE-27044
> URL: https://issues.apache.org/jira/browse/HBASE-27044
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Reporter: Josh Elser
>Priority: Major
>
> We ran into an interesting bug when test teams were running HBase against 
> cloud storage without ensuring that the previous location was cleaned. This 
> resulted in an hbase.rootdir that had:
>  * A valid HBase MasterData Region
>  * A valid hbase:meta
>  * A valid collection of HBase tables
>  * An empty ZooKeeper
> Through the changes that we've worked on prior, those described in 
> HBASE-24286 were effective in getting every _except_ the Procedures back 
> online without issue. Parsing the existing procedures produced an interesting 
> error:
> {noformat}
> java.lang.IllegalArgumentException: Illegal principal name 
> hbase/wrong-hostname.domain@WRONG_REALM: 
> org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: 
> No rules applied to hbase/wrong-hostname.domain@WRONG_REALM
>   at org.apache.hadoop.security.User.(User.java:51)
>   at org.apache.hadoop.security.User.(User.java:43)
>   at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1418)
>   at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1402)
>   at 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.toUserInfo(MasterProcedureUtil.java:60)
>   at 
> org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.deserializeStateData(ModifyTableProcedure.java:262)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294)
>   at 
> org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43)
>   at 
> org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:411)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:78)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:339)
>   at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:285)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:330)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:600)
>   at 
> org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1581)
>   at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:835)
>   at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2205)
>   at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:514)
>   at java.lang.Thread.run(Thread.java:750) {noformat}
> What's actually happening is that we are storing the {{User}} into the 
> procedure and then relying on UserGroupInformation to parse the {{User}} 
> protobuf into a UGI to get the "short" username.
> When the serialized procedure (whether in the MasterData region over via PV2 
> WAL files, I think) gets loaded, we end up needing Hadoop auth_to_local 
> configuration to be able to parse that kerberos principal back to a name. 
> However, Hadoop's KerberosName will only unwrap Kerberos principals which 
> match the local Kerberos realm (defined by the krb5.conf's default_realm, 
> [r

[jira] [Created] (HBASE-27044) Serialized procedures which point to users from other Kerberos domains can prevent master startup

2022-05-16 Thread Josh Elser (Jira)
Josh Elser created HBASE-27044:
--

 Summary: Serialized procedures which point to users from other 
Kerberos domains can prevent master startup
 Key: HBASE-27044
 URL: https://issues.apache.org/jira/browse/HBASE-27044
 Project: HBase
  Issue Type: Bug
  Components: proc-v2
Reporter: Josh Elser


We ran into an interesting bug when test teams were running HBase against cloud 
storage without ensuring that the previous location was cleaned. This resulted 
in an hbase.rootdir that had:
 * A valid HBase MasterData Region
 * A valid hbase:meta
 * A valid collection of HBase tables
 * An empty ZooKeeper

Through the changes that we've worked on prior, those described in HBASE-24286 
were effective in getting every _except_ the Procedures back online without 
issue. Parsing the existing procedures produced an interesting error:
{noformat}
java.lang.IllegalArgumentException: Illegal principal name 
hbase/wrong-hostname.domain@WRONG_REALM: 
org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No 
rules applied to hbase/wrong-hostname.domain@WRONG_REALM
at org.apache.hadoop.security.User.(User.java:51)
at org.apache.hadoop.security.User.(User.java:43)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1418)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1402)
at 
org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.toUserInfo(MasterProcedureUtil.java:60)
at 
org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.deserializeStateData(ModifyTableProcedure.java:262)
at 
org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294)
at 
org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43)
at 
org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:411)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:339)
at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:285)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:330)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:600)
at 
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1581)
at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:835)
at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2205)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:514)
at java.lang.Thread.run(Thread.java:750) {noformat}
What's actually happening is that we are storing the {{User}} into the 
procedure and then relying on UserGroupInformation to parse the {{User}} 
protobuf into a UGI to get the "short" username.

When the serialized procedure (whether in the MasterData region over via PV2 
WAL files, I think) gets loaded, we end up needing Hadoop auth_to_local 
configuration to be able to parse that kerberos principal back to a name. 
However, Hadoop's KerberosName will only unwrap Kerberos principals which match 
the local Kerberos realm (defined by the krb5.conf's default_realm, 
[ref|https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/classes/sun/security/krb5/Config.java#L978-L983])

The interesting part is that we don't seem to ever use the user _other_ than to 
display the {{owner}} attribute for procedures on the HBase UI. There is a 
method in hbase-procedure which can filter procedures based on Owner, but I 
didn't see any usages of that method.

Given the pushback against HBASE-24286, I assume that, for the same reasons, we 
would see pushback against fixing this issue. However, I wanted to call it out 
for posterity. The expectation of users is that HBase _should_ implicitly 
handle this case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HBASE-27044) Serialized procedures which point to users from other Kerberos domains can prevent master startup

2022-05-16 Thread Josh Elser (Jira)
Josh Elser created HBASE-27044:
--

 Summary: Serialized procedures which point to users from other 
Kerberos domains can prevent master startup
 Key: HBASE-27044
 URL: https://issues.apache.org/jira/browse/HBASE-27044
 Project: HBase
  Issue Type: Bug
  Components: proc-v2
Reporter: Josh Elser


We ran into an interesting bug when test teams were running HBase against cloud 
storage without ensuring that the previous location was cleaned. This resulted 
in an hbase.rootdir that had:
 * A valid HBase MasterData Region
 * A valid hbase:meta
 * A valid collection of HBase tables
 * An empty ZooKeeper

Through the changes that we've worked on prior, those described in HBASE-24286 
were effective in getting every _except_ the Procedures back online without 
issue. Parsing the existing procedures produced an interesting error:
{noformat}
java.lang.IllegalArgumentException: Illegal principal name 
hbase/wrong-hostname.domain@WRONG_REALM: 
org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No 
rules applied to hbase/wrong-hostname.domain@WRONG_REALM
at org.apache.hadoop.security.User.(User.java:51)
at org.apache.hadoop.security.User.(User.java:43)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1418)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1402)
at 
org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.toUserInfo(MasterProcedureUtil.java:60)
at 
org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.deserializeStateData(ModifyTableProcedure.java:262)
at 
org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294)
at 
org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43)
at 
org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:411)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.load(ProcedureExecutor.java:339)
at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:285)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:330)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:600)
at 
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1581)
at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:835)
at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2205)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:514)
at java.lang.Thread.run(Thread.java:750) {noformat}
What's actually happening is that we are storing the {{User}} into the 
procedure and then relying on UserGroupInformation to parse the {{User}} 
protobuf into a UGI to get the "short" username.

When the serialized procedure (whether in the MasterData region over via PV2 
WAL files, I think) gets loaded, we end up needing Hadoop auth_to_local 
configuration to be able to parse that kerberos principal back to a name. 
However, Hadoop's KerberosName will only unwrap Kerberos principals which match 
the local Kerberos realm (defined by the krb5.conf's default_realm, 
[ref|https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/classes/sun/security/krb5/Config.java#L978-L983])

The interesting part is that we don't seem to ever use the user _other_ than to 
display the {{owner}} attribute for procedures on the HBase UI. There is a 
method in hbase-procedure which can filter procedures based on Owner, but I 
didn't see any usages of that method.

Given the pushback against HBASE-24286, I assume that, for the same reasons, we 
would see pushback against fixing this issue. However, I wanted to call it out 
for posterity. The expectation of users is that HBase _should_ implicitly 
handle this case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (RANGER-3758) Decrease log-level when no HBase remote client address is found

2022-05-12 Thread Josh Elser (Jira)
Josh Elser created RANGER-3758:
--

 Summary: Decrease log-level when no HBase remote client address is 
found
 Key: RANGER-3758
 URL: https://issues.apache.org/jira/browse/RANGER-3758
 Project: Ranger
  Issue Type: Task
  Components: Ranger
Reporter: Josh Elser


We deal with really annoying logging in HBase services because of this one line
{noformat}
2022-04-17 17:51:05,481 INFO 
org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor: Unable to 
get remote Address {noformat}
20% of all HBase logging in one RegionServer is from this one log message. 
There is zero value derived from this log message as it is completely expected 
that HBase will perform operations on its own which Ranger would audit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HBASE-27013) Introduce read all bytes when using pread for prefetch

2022-05-10 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534638#comment-17534638
 ] 

Josh Elser commented on HBASE-27013:


{quote}In the case of the input stream read short and when the input stream 
read passed the length of the necessary data block with few more bytes within 
the size of next block header, the 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 returns to the caller without a cached the next block header. As a result, 
before HBase tries to read the next block, 
[HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
 in hbase tries to re-read the next block header from the input stream.
{quote}
If we read the comment on the code that Stephen called out in 
readBlockDataInternal, you can find:
{code:java}
If header was not cached (see getCachedHeader above), need to seek to pull it 
in. This is costly and should happen very rarely {code}
And then you had also said:
{quote}The root cause of above issue was due to 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 is reading an input stream that does not guarrentee to return the data block 
and the next block header as an option data to be cached.
{quote}
I think what you're saying is the following.
 # Read header for block1
 # Read block1 and try to read block2's header
 # Read block2 and try to read block3's header
 # Repeat

This would align with the comment, too. Where, the last time we read, we tried 
to get the header cached, such that the _next_ time we come back to read, we 
have that header already cached and can avoid another {{seek()}} (through the 
pread).

The very high-level reading of the HBase code would indicate to me that we 
_expect_ to read the n+1th block header when reading the nth block. I would 
assume that we also want this for HDFS base cluster, but HDFS just does a good 
enough job that we haven't noticed this being a problem (short-circuit reads 
making our live happy?).

I think attempting to read off the end of a file is not a big concern since 
we're just pulling those extra bytes off in the current read. I am thinking 
about a different drawback where, if the InputStream isn't giving us the bytes 
we asked for back, why was that? Did it take over some threshold of time? If we 
go back and ask HDFS (or S3) again "give me those extra bytes", would we 
increase the overall latency? Genuinely not sure.

I think, long-term, it makes sense for this configuration to be on by default, 
but I am motivated by the expose this configuration property for additional 
testing on HDFS while committing this change to try to help the S3-based 
prefetching workload. I'm leaning towards putting this in since the risk is low 
(given my understanding).

WDYT, Duo?

> Introduce read all bytes when using pread for prefetch
> --
>
> Key: HBASE-27013
> URL: https://issues.apache.org/jira/browse/HBASE-27013
> Project: HBase
>  Issue Type: Improvement
>  Components: HFile, Performance
>Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>
> h2. Problem statement
> When prefetching HFiles from blob storage like S3 and use it with the storage 
> implementation like S3A, we found there is a logical issue in HBase pread 
> that causes the reading of the remote HFile aborts the input stream multiple 
> times. This aborted stream and reopen slow down the reads and trigger many 
> aborted bytes and waste time in recreating the connection especially when SSL 
> is enabled.
> h2. ROOT CAUSE
> The root cause of above issue was due to 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  is reading an input stream that does not guarrentee to return the data block 
> and the next block header as an option data to be cached.
> In the case of the input stream read short and when the input stream read 
> passed the length of the necessary data block with few more bytes within the 
> size of next block header, the 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/sr

[jira] [Comment Edited] (HBASE-27013) Introduce read all bytes when using pread for prefetch

2022-05-10 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534638#comment-17534638
 ] 

Josh Elser edited comment on HBASE-27013 at 5/11/22 1:46 AM:
-

{quote}In the case of the input stream read short and when the input stream 
read passed the length of the necessary data block with few more bytes within 
the size of next block header, the 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 returns to the caller without a cached the next block header. As a result, 
before HBase tries to read the next block, 
[HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
 in hbase tries to re-read the next block header from the input stream.
{quote}
If we read the comment on the code that Stephen called out in 
readBlockDataInternal, you can find:
{code:java}
If header was not cached (see getCachedHeader above), need to seek to pull it 
in. This is costly and should happen very rarely {code}
And then you had also said:
{quote}The root cause of above issue was due to 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 is reading an input stream that does not guarrentee to return the data block 
and the next block header as an option data to be cached.
{quote}
I think what you're saying is the following.
 # Read header for block1
 # Read block1 and try to read block2's header
 # Read block2 and try to read block3's header
 # Repeat

This would align with the comment, too. Where, the last time we read, we tried 
to get the header cached, such that the _next_ time we come back to read, we 
have that header already cached and can avoid another {{seek()}} (through the 
pread).

The very high-level reading of the HBase code would indicate to me that we 
_expect_ to read the n+1th block header when reading the nth block. I would 
assume that we also want this for HDFS base cluster, but HDFS just does a good 
enough job that we haven't noticed this being a problem (short-circuit reads 
making our live happy?).

I think attempting to read off the end of a file is not a big concern since 
we're just pulling those extra bytes off in the current read. I am thinking 
about a different drawback where, if the InputStream isn't giving us the bytes 
we asked for back, why was that? Did it take over some threshold of time? If we 
go back and ask HDFS (or S3) again "give me those extra bytes", would we 
increase the overall latency? Genuinely not sure.

I think, long-term, it makes sense for this configuration to be on by default, 
but I am motivated by the expose this configuration property for additional 
testing on HDFS while committing this change to try to help the S3-based 
prefetching workload. I'm leaning towards putting this in since the risk is low 
(given my understanding).

WDYT, Duo? Stephen, did I get this all correct? (please correct me if I'm wrong)


was (Author: elserj):
{quote}In the case of the input stream read short and when the input stream 
read passed the length of the necessary data block with few more bytes within 
the size of next block header, the 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 returns to the caller without a cached the next block header. As a result, 
before HBase tries to read the next block, 
[HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
 in hbase tries to re-read the next block header from the input stream.
{quote}
If we read the comment on the code that Stephen called out in 
readBlockDataInternal, you can find:
{code:java}
If header was not cached (see getCachedHeader above), need to seek to pull it 
in. This is costly and should happen very rarely {code}
And then you had also said:
{quote}The root cause of above issue was due to 
[BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
 is reading an input stream that does not guarrentee to return the data block 
and the next block header as an option data to be cached.
{quote}
I think what you're saying is the following.
 # Read header for block1
 # Read block1 and try to read block2's header
 # Read block2 and try to read block

[jira] [Commented] (HBASE-27013) Introduce read all bytes when using pread for prefetch

2022-05-09 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533995#comment-17533995
 ] 

Josh Elser commented on HBASE-27013:


{quote}So the problem here is, the implementation of S3A is not HDFS, we can 
not reuse the stream to send multiple pread requests with random offset. Seems 
not like a good enough pread implementation...
{quote}
Yeah, s3a != hdfs is definitely a major pain point. IIUC, HBase nor HDFS are 
doing anything wrong, per se. HDFS just happens to handle this super fast and 
s3a... doesn't.
{quote}In general, in pread mode, a FSDataInputStream may be used by different 
read requests so even if you fixed this problem, it could still introduce a lot 
of aborts as different read request may read from different offsets...
{quote}
Right again – focus being put on prefetching as we know that once hfiles are 
cached, things are super fast. Thus, this is the first problem to chase. 
However, any operations over a table which isn't fully cache would end up 
over-reading from s3. I had thought about whether we just write a custom Reader 
for the prefetch case, but then we wouldn't address the rest of the access 
paths (e.g. scans).

Stephen's worst case numbers are still ~130MB/s to pull down HFiles from S3 to 
cache which is good on the surface, but not so good when you compare to the 
closer to 1GB/s that you can get through awscli (and whatever their 
parallelized downloader was called). One optimization at a time :)

> Introduce read all bytes when using pread for prefetch
> --
>
> Key: HBASE-27013
> URL: https://issues.apache.org/jira/browse/HBASE-27013
> Project: HBase
>  Issue Type: Improvement
>  Components: HFile, Performance
>Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.13
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
>
> h2. Problem statement
> When prefetching HFiles from blob storage like S3 and use it with the storage 
> implementation like S3A, we found there is a logical issue in HBase pread 
> that causes the reading of the remote HFile aborts the input stream multiple 
> times. This aborted stream and reopen slow down the reads and trigger many 
> aborted bytes and waste time in recreating the connection especially when SSL 
> is enabled.
> h2. ROOT CAUSE
> The root cause of above issue was due to 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  is reading an input stream that does not guarrentee to return the data block 
> and the next block header as an option data to be cached.
> In the case of the input stream read short and when the input stream read 
> passed the length of the necessary data block with few more bytes within the 
> size of next block header, the 
> [BlockIOUtils#preadWithExtra|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/BlockIOUtils.java#L214-L257]
>  returns to the caller without a cached the next block header. As a result, 
> before HBase tries to read the next block, 
> [HFileBlock#readBlockDataInternal|https://github.com/apache/hbase/blob/9c8c9e7fbf8005ea89fa9b13d6d063b9f0240443/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1648-L1664]
>  in hbase tries to re-read the next block header from the input stream. Here, 
> the reusable input stream has move the current position pointer ahead from 
> the offset of the last read data block, when using with the [S3A 
> implementation|https://github.com/apache/hadoop/blob/29401c820377d02a992eecde51083cf87f8e57af/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java#L339-L361],
>  the input stream is then closed, aborted all the remaining bytes and reopen 
> a new input stream at the offset of the last read data block .
> h2. How do we fix it?
> S3A is doing the right job that HBase is telling to move the offset from 
> position A back to A - N, so there is not much thing we can do on how S3A 
> handle the inputstream. meanwhile in the case of HDFS, this operation is fast.
> Such that, we should fix in HBase level, and try always to read datablock + 
> next block header when we're using blob storage to avoid expensive draining 
> the bytes in a stream and reopen the socket with the remote storage.
> h2. Draw back and discussion
>  * A known drawback is, when we're at the last block, we will read extra 
> length that should not be a header, and we still read that into the b

[jira] [Commented] (HBASE-26999) HStore should try write WAL compaction marker before replacing compacted files in StoreEngine

2022-05-05 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532263#comment-17532263
 ] 

Josh Elser commented on HBASE-26999:


bq. there is no way to still allow the RS to delete any HFiles on HDFS

Agree. Had a long talk with Szabolcs and Wellington yesterday and we came to 
that same conclusion (sorry I didn't make upstream Jira issues for it yet).

First reaction was that BrokenStoreFileCleaner needs to collect possible files 
and then only delete them after a double-check that we still hold the lock (or 
some similar kind of thing). There's a bigger point lingering here where we've 
created a new class of broken that we didn't have before because we always 
archived files (never deleted immediately). This was some "buffer" which would 
have precluded this kind of problem in the past, I think.

> HStore should try write WAL compaction marker before replacing compacted 
> files in StoreEngine
> -
>
> Key: HBASE-26999
> URL: https://issues.apache.org/jira/browse/HBASE-26999
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Major
>
> On HBASE-26064, it seems we altered the order we update different places with 
> the results of a compaction:
> {noformat}
> @@ -1510,14 +1149,13 @@ public class HStore implements Store, HeapSize, 
> StoreConfigInformation,
>        List newFiles) throws IOException {
>      // Do the steps necessary to complete the compaction.
>      setStoragePolicyFromFileName(newFiles);
> -    List sfs = commitStoreFiles(newFiles, true);
> +    List sfs = storeEngine.commitStoreFiles(newFiles, true);
>      if (this.getCoprocessorHost() != null) {
>        for (HStoreFile sf : sfs) {
>          getCoprocessorHost().postCompact(this, sf, cr.getTracker(), cr, 
> user);
>        }
>      }
> -    writeCompactionWalRecord(filesToCompact, sfs);
> -    replaceStoreFiles(filesToCompact, sfs);
> +    replaceStoreFiles(filesToCompact, sfs, true);
> ...
> @@ -1581,25 +1219,24 @@ public class HStore implements Store, HeapSize, 
> StoreConfigInformation,
>          this.region.getRegionInfo(), compactionDescriptor, 
> this.region.getMVCC());
>    }
>  
> -  void replaceStoreFiles(Collection compactedFiles, 
> Collection result)
> -      throws IOException {
> -    this.lock.writeLock().lock();
> -    try {
> -      
> this.storeEngine.getStoreFileManager().addCompactionResults(compactedFiles, 
> result);
> -      synchronized (filesCompacting) {
> -        filesCompacting.removeAll(compactedFiles);
> -      }
> -
> -      // These may be null when the RS is shutting down. The space quota 
> Chores will fix the Region
> -      // sizes later so it's not super-critical if we miss these.
> -      RegionServerServices rsServices = region.getRegionServerServices();
> -      if (rsServices != null && 
> rsServices.getRegionServerSpaceQuotaManager() != null) {
> -        updateSpaceQuotaAfterFileReplacement(
> -            
> rsServices.getRegionServerSpaceQuotaManager().getRegionSizeStore(), 
> getRegionInfo(),
> -            compactedFiles, result);
> -      }
> -    } finally {
> -      this.lock.writeLock().unlock();
> +  @RestrictedApi(explanation = "Should only be called in TestHStore", link = 
> "",
> +    allowedOnPath = ".*/(HStore|TestHStore).java")
> +  void replaceStoreFiles(Collection compactedFiles, 
> Collection result,
> +    boolean writeCompactionMarker) throws IOException {
> +    storeEngine.replaceStoreFiles(compactedFiles, result);
> +    if (writeCompactionMarker) {
> +      writeCompactionWalRecord(compactedFiles, result);
> +    }
> +    synchronized (filesCompacting) {
> +      filesCompacting.removeAll(compactedFiles);
> +    }
> +    // These may be null when the RS is shutting down. The space quota 
> Chores will fix the Region
> +    // sizes later so it's not super-critical if we miss these.
> +    RegionServerServices rsServices = region.getRegionServerServices();
> +    if (rsServices != null && rsServices.getRegionServerSpaceQuotaManager() 
> != null) {
> +      updateSpaceQuotaAfterFileReplacement(
> +        rsServices.getRegionServerSpaceQuotaManager().getRegionSizeStore(), 
> getRegionInfo(),
> +        compactedFiles, result); {noformat}
> While running some large scale load test, we run into File SFT metafiles 
> inconsistency that we believe could have been avoided if the original order 
> was in 

[jira] [Created] (CALCITE-5129) Exception thrown writing to a closed stream with SPNEGO authentication at DEBUG

2022-05-03 Thread Josh Elser (Jira)
Josh Elser created CALCITE-5129:
---

 Summary: Exception thrown writing to a closed stream with SPNEGO 
authentication at DEBUG
 Key: CALCITE-5129
 URL: https://issues.apache.org/jira/browse/CALCITE-5129
 Project: Calcite
  Issue Type: Bug
Reporter: Josh Elser
Assignee: Josh Elser


{noformat}
2022-05-03 18:27:57,651 WARN org.eclipse.jetty.server.HttpChannelState: 
unhandled due to prior sendError
org.eclipse.jetty.io.EofException: Closed
        at 
org.eclipse.jetty.server.HttpOutput.checkWritable(HttpOutput.java:762)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:792)
        at java.io.OutputStream.write(OutputStream.java:75)
        at 
org.apache.calcite.avatica.server.AbstractAvaticaHandler.isUserPermitted(AbstractAvaticaHandler.java:71)
        at 
org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:103)
        at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at 
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at 
org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:540)
        at 
org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:395)
        at 
org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
        at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
 {noformat}
Trying to test CALCITE-4152 behind Apache Knox, I noticed the following in the 
server-side logs.

It appears that we end up spitting out an exception when another layer of code 
has already called {{sendError()}} which prevents any further writes to the 
OutputStream (destined back to the client). I think this is cosmetic, but I'm 
not 100% certain at this point.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (CALCITE-5129) Exception thrown writing to a closed stream with SPNEGO authentication at DEBUG

2022-05-03 Thread Josh Elser (Jira)
Josh Elser created CALCITE-5129:
---

 Summary: Exception thrown writing to a closed stream with SPNEGO 
authentication at DEBUG
 Key: CALCITE-5129
 URL: https://issues.apache.org/jira/browse/CALCITE-5129
 Project: Calcite
  Issue Type: Bug
Reporter: Josh Elser
Assignee: Josh Elser


{noformat}
2022-05-03 18:27:57,651 WARN org.eclipse.jetty.server.HttpChannelState: 
unhandled due to prior sendError
org.eclipse.jetty.io.EofException: Closed
        at 
org.eclipse.jetty.server.HttpOutput.checkWritable(HttpOutput.java:762)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:792)
        at java.io.OutputStream.write(OutputStream.java:75)
        at 
org.apache.calcite.avatica.server.AbstractAvaticaHandler.isUserPermitted(AbstractAvaticaHandler.java:71)
        at 
org.apache.calcite.avatica.server.AvaticaProtobufHandler.handle(AvaticaProtobufHandler.java:103)
        at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at 
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at 
org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:540)
        at 
org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:395)
        at 
org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
        at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
 {noformat}
Trying to test CALCITE-4152 behind Apache Knox, I noticed the following in the 
server-side logs.

It appears that we end up spitting out an exception when another layer of code 
has already called {{sendError()}} which prevents any further writes to the 
OutputStream (destined back to the client). I think this is cosmetic, but I'm 
not 100% certain at this point.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (PHOENIX-6701) Do not set ReaderType.STREAM in IndexHalfStoreFileReader

2022-04-29 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530231#comment-17530231
 ] 

Josh Elser commented on PHOENIX-6701:
-

Thanks, Istvan.

We had a lot of performance issues which were chalked up to the stream 
switchover after the pread threshold was exceeded. I'm reticent to push that 
into the indexhalfstore reader as I think we'd experience the same slow down.

> Do not set ReaderType.STREAM in IndexHalfStoreFileReader
> 
>
> Key: PHOENIX-6701
> URL: https://issues.apache.org/jira/browse/PHOENIX-6701
> Project: Phoenix
>  Issue Type: Improvement
>  Components: core
>Reporter: Istvan Toth
>Priority: Major
>
> We generally use ReaderType.PREAD for performace reasons.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [DISCUSS] Releasing the next Omid version

2022-04-28 Thread Josh Elser

+1

If we're dropping Phoenix 4.x imminently, that means dropping HBase 1.x 
and we should follow suit in Omid.


A "beta" HBase 3.0 is probably not too, too far away. I would consider 
how "nice" the current shim logic is in Omid (i.e. is it actually 
helpful? nice to work with? effective?), and make the call on that. 
However, HBase 3.0 should not drop any API from HBase 2.x, so we should 
not _have_ to shim anything.


1.1.0 as a release version makes sense to me for the API reason you gave.

On 4/19/22 4:53 AM, Istvan Toth wrote:

Hi!

When Geoffrey proposed releasing Phoenix 5.2.0, I asked for time to release
a new Omid version first, as there are a lot of unreleased fixes in  master.

One of those fixes removes the need to add a lot of explicit excludes for
the Omid HBase-1 artifacts when depending on Omid for HBase 2.

However, the discussion on dropping HBase 1.x support from Phoenixhas been
re-opened, and so far there are no objections.

We can either release the Omid master as is (perhaps with some dependency
version bumps), or we could just drop HBase 1.x support, and simplify the
project structure quite a bit for the next version.

In case we drop support for Hbase 1, we also need to decide whether to keep
the maven build infrastructure (shims and flatten-maven-plugin) for
supporting different HBase releases for an upcoming HBase 3 release (if the
API changes will require it), or to remove it altogether ?

We'll need to update the dependencies and exclusions in Phoenix either way.

What do you think ?
Can we make an official decision to drop Phoenix 4.x soon, and drop HBase 1
support from Omid for the next release,
or should I just go ahead with the Omid next release process, and worry
about removing the HBase 1.x support from Omid later ?

Also, as we're making incompatible changes to the way Omid is to be
consumed via maven, I think that we should bump the version either to
1.1.0, or 2.0.0. (I prefer 1.1.0, as the API doesn't change.)

Looking forward to your input,

Istvan



Re: [DISCUSS] Drop support for HBase 2.1 and 2.2 in Phoenix 5.2 ?

2022-04-28 Thread Josh Elser
Definitely makes sense to drop 2.1 and 2.2 (which are long gone in 
upstream support).


2.3 isn't mentioned on HBase downloads.html anymore so I think that's 
also good to go, but 2.4 is still very much alive.


On 4/19/22 11:32 AM, Geoffrey Jacoby wrote:

+1 to dropping support for 2.1 and 2.2.

Because of some incompatible 2.0-era changes to coprocessor interfaces, and
a bug around raw filters, we weren't able to support the newer global
indexes at all on 2.1, and even on 2.2 we have an issue where we can't
protect index consistency during major compaction. Getting rid of 2.1 and
2.2 support would let us simplify a lot.

Geoffrey

On Tue, Apr 19, 2022 at 6:28 AM Istvan Toth  wrote:


We can also consider dropping support for 2.4.0.

On Tue, Apr 19, 2022 at 12:21 PM Istvan Toth  wrote:


Hi!

Both Hbase 2.1 and 2.2 have been EOL for a little more than a year.

Do we want to keep supporting them in HBase 5.2 ?

Keeping them is not a big burden, as the compatibility modules are ready,
but we could simplify the compatibility module interface a bit, and free

up

resources in the multibranch test builds.

WDYT ?

Istvan







Re: [DISCUSS] Switching Phoenix to log4j2

2022-04-28 Thread Josh Elser

Agree on your solution proposed, Istvan.

I think a Phoenix 5.2 is the right time to take that on, too.

On 4/26/22 2:21 PM, Andrew Purtell wrote:

Thanks, I understand better.


What I am proposing is keeping phoenix-client-embedded, but dropping the

legacy (non embedded) phoenix-client jar/artifact from 5.2.

+1, for what it's worth. Embedding a logging back end is a bad idea as we
have learned. Only the facade (SLF4J) should be necessary.


On Tue, Apr 26, 2022 at 11:14 AM Istvan Toth 
wrote:


Andrew, what you describe is the phoenix-client-embedded jar, and it is the
(or at least my) preferred way to consume the phoenix thick client.

However, we still build and publish the legacy phoenix-client (non
embedded) JAR, that DOES include the slf4j + logging backend libraries (as
well as sqlline + jline)

What I am proposing is keeping phoenix-client-embedded, but dropping the
legacy (non embedded) phoenix-client jar/artifact from 5.2.

sqlline.py and friends used to use the non-embedded jar, so that they get
logging and sqlline, but I have since modified all scripts to use the
embedded client, and add the logging backend and sqlline from /lib, so
nothing we ship depends on the legacy phoenix-client JAR any longer.

regards
Istvan






Re: [ANNOUNCE] New HBase committer Bryan Beaudreault

2022-04-26 Thread Josh Elser

Congrats Bryan!

On 4/9/22 7:44 AM, 张铎(Duo Zhang) wrote:

On behalf of the Apache HBase PMC, I am pleased to announce that Bryan
Beaudreault(bbeaudreault) has accepted the PMC's invitation to become a
committer on the project. We appreciate all of Bryan's generous
contributions thus far and look forward to his continued involvement.

Congratulations and welcome, Bryan Beaudreault!

我很高兴代表 Apache HBase PMC 宣布 Bryan Beaudreault 已接受我们的邀请,成为 Apache HBase 项目的
Committer。感谢 Bryan Beaudreault 一直以来为 HBase 项目做出的贡献,并期待他在未来继续承担更多的责任。

欢迎 Bryan Beaudreault!


[jira] [Commented] (HBASE-26708) Netty Leak detected and eventually results in OutOfDirectMemoryError

2022-04-25 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527766#comment-17527766
 ] 

Josh Elser commented on HBASE-26708:


Well, a few more data points for me:
 * Only workload on the system was replication
 * leakDetection=advanced gave similar looking allocators/usages that Viraj 
shared
 * Running leakDetection=paranoid on {{mvn verify -PrunReplicationTests}} 
showed nothing
 * Switching from NettyRPCServer back to BlockingRpcServer appears to have 
fixed (at least most) of the problem. Running over a weekend, there appears to 
be little grow of native memory where it was previously obvious after a few 
hours.

Happening on a cluster that isn't mine. Next steps would require trying to 
reproduce it in house.

> Netty Leak detected and eventually results in OutOfDirectMemoryError
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.6
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.12
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.th

[jira] [Commented] (HBASE-26708) Netty Leak detected and eventually results in OutOfDirectMemoryError

2022-04-21 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526094#comment-17526094
 ] 

Josh Elser commented on HBASE-26708:


[~vjasani] did you ever get to the bottom of this one? We have a case where 
we're seeing a slow leak and warnings from Netty in the same "Created at" point.

> Netty Leak detected and eventually results in OutOfDirectMemoryError
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.6
>Reporter: Viraj Jasani
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.12
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hba

[jira] [Comment Edited] (HBASE-26938) Compaction failures after StoreFileTracker integration

2022-04-14 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522580#comment-17522580
 ] 

Josh Elser edited comment on HBASE-26938 at 4/14/22 11:10 PM:
--

I hope I didn't push you to abandon your PR with my comments, Andrew, I just 
intended them as a part of normal review. I think your approach was perfectly 
fine. That said, agree wholly with your comments that we need to choose one 
approach and go with it.

[~zhangduo] I went through your commit and left some [comments dangling on 
there|https://github.com/Apache9/hbase/commit/79fa3a9d72aade0bfe490b301f909b3d5722de06].
 I'd suggest converting that into a PR and getting some tests run against it. 
I'll see if I can steal some time to throw it up on a cluster and test it as 
well. I think your approach is a little cleaner (though a little less explicit 
– more layers to unwrap to find the "magic" that actually uses that 
StoreFileWriterTracker :))


was (Author: elserj):
I hope I didn't push you to abandon your PR with my comments, Andrew, I just 
intended them as a part of normal review. I think your approach was perfectly 
fine. That said, agree wholly with your comments that we need to choose one 
approach and go with it.

[~zhangduo] I went through your commit and left some [comments dangling on 
there|[https://github.com/Apache9/hbase/commit/79fa3a9d72aade0bfe490b301f909b3d5722de06]|https://github.com/Apache9/hbase/commit/79fa3a9d72aade0bfe490b301f909b3d5722de06].].
 I'd suggest converting that into a PR and getting some tests run against it. 
I'll see if I can steal some time to throw it up on a cluster and test it as 
well. I think your approach is a little cleaner (though a little less explicit 
– more layers to unwrap to find the "magic" that actually uses that 
StoreFileWriterTracker :))

> Compaction failures after StoreFileTracker integration
> --
>
> Key: HBASE-26938
> URL: https://issues.apache.org/jira/browse/HBASE-26938
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 2.5.0, 3.0.0-alpha-2, 2.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> [ Currently this has only been tested with branch-2.5 and branch-2. Testing 
> with master next, will update afterward. ]
> Test cluster of 10 regionservers is configured each RS with 5 flush threads, 
> 5 large compaction threads, and 10 small compaction threads. 
> Hadoop is 3.3.2. Java is 11. HFiles are on HDFS. 
> All the StoreFileTracker implementations, DEFAULT or FILE, exhibit compaction 
> time store writer errors in an ingest heavy use case. Unit tests don't seem 
> to cover whatever this is. Most compactions succeed, but some do not. Those 
> that do not are failing with state or sanity check assertions. Below errors 
> are all from DEFAULT. They seem related... store writer instance 
> usage/close/locking issues during compactions.
> Warnings like "writer exists when it should not":
> {noformat}
> 2022-04-07T23:13:11,351 WARN  
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> compactions.Compactor: Writer exists when it should not: {
>   
> hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/b518f72941d4427e7e1923407643df67/.tmp/c/29d7b88c4c214ddcbba4f747514a2cf5
>  }
> {noformat}
> Errors like:
> IllegalStateException thrown from 
> HFileBlockIndex$BlockIndexWriter.shouldWriteBlock:
> {noformat}
> 2022-04-07T23:13:11,508 ERROR 
> [regionserver/ip-172-31-63-83:8120-shortCompactions-6] 
> regionserver.CompactSplit: Compaction failed 
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
>  storeName=b518f72941d4427e7e1923407643df67/c, priority=10, 
> startTime=1649373185476
> java.lang.IllegalStateException: curInlineChunk is null; has shouldWriteBlock 
> been called with closing=true and then called again?
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.shouldWriteBlock(HFileBlockIndex.java:1258)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.writeInlineBlocks(HFileWriterImpl.java:523)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:608)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>  ~[hbase-server

[jira] [Commented] (HBASE-26938) Compaction failures after StoreFileTracker integration

2022-04-14 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522580#comment-17522580
 ] 

Josh Elser commented on HBASE-26938:


I hope I didn't push you to abandon your PR with my comments, Andrew, I just 
intended them as a part of normal review. I think your approach was perfectly 
fine. That said, agree wholly with your comments that we need to choose one 
approach and go with it.

[~zhangduo] I went through your commit and left some [comments dangling on 
there|[https://github.com/Apache9/hbase/commit/79fa3a9d72aade0bfe490b301f909b3d5722de06]|https://github.com/Apache9/hbase/commit/79fa3a9d72aade0bfe490b301f909b3d5722de06].].
 I'd suggest converting that into a PR and getting some tests run against it. 
I'll see if I can steal some time to throw it up on a cluster and test it as 
well. I think your approach is a little cleaner (though a little less explicit 
– more layers to unwrap to find the "magic" that actually uses that 
StoreFileWriterTracker :))

> Compaction failures after StoreFileTracker integration
> --
>
> Key: HBASE-26938
> URL: https://issues.apache.org/jira/browse/HBASE-26938
> Project: HBase
>  Issue Type: Bug
>  Components: Compaction
>Affects Versions: 2.5.0, 3.0.0-alpha-2, 2.6.0
>Reporter: Andrew Kyle Purtell
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-3
>
>
> [ Currently this has only been tested with branch-2.5 and branch-2. Testing 
> with master next, will update afterward. ]
> Test cluster of 10 regionservers is configured each RS with 5 flush threads, 
> 5 large compaction threads, and 10 small compaction threads. 
> Hadoop is 3.3.2. Java is 11. HFiles are on HDFS. 
> All the StoreFileTracker implementations, DEFAULT or FILE, exhibit compaction 
> time store writer errors in an ingest heavy use case. Unit tests don't seem 
> to cover whatever this is. Most compactions succeed, but some do not. Those 
> that do not are failing with state or sanity check assertions. Below errors 
> are all from DEFAULT. They seem related... store writer instance 
> usage/close/locking issues during compactions.
> Warnings like "writer exists when it should not":
> {noformat}
> 2022-04-07T23:13:11,351 WARN  
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> compactions.Compactor: Writer exists when it should not: {
>   
> hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/b518f72941d4427e7e1923407643df67/.tmp/c/29d7b88c4c214ddcbba4f747514a2cf5
>  }
> {noformat}
> Errors like:
> IllegalStateException thrown from 
> HFileBlockIndex$BlockIndexWriter.shouldWriteBlock:
> {noformat}
> 2022-04-07T23:13:11,508 ERROR 
> [regionserver/ip-172-31-63-83:8120-shortCompactions-6] 
> regionserver.CompactSplit: Compaction failed 
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
>  storeName=b518f72941d4427e7e1923407643df67/c, priority=10, 
> startTime=1649373185476
> java.lang.IllegalStateException: curInlineChunk is null; has shouldWriteBlock 
> been called with closing=true and then called again?
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.shouldWriteBlock(HFileBlockIndex.java:1258)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.writeInlineBlocks(HFileWriterImpl.java:523)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:608)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:84)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:76)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:384)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext

[jira] [Commented] (HBASE-26938) Compaction failures after StoreFileTracker integration (branch-2, branch-2.5)

2022-04-08 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519662#comment-17519662
 ] 

Josh Elser commented on HBASE-26938:


Thanks much! I've shared this Jira with the folks internally helping out and 
we'll converge here -if- when we figure something out :flex:

> Compaction failures after StoreFileTracker integration (branch-2, branch-2.5)
> -
>
> Key: HBASE-26938
> URL: https://issues.apache.org/jira/browse/HBASE-26938
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 2.6.0
>Reporter: Andrew Kyle Purtell
>Priority: Blocker
> Fix For: 2.5.0
>
>
> [ Currently this has only been tested with branch-2.5 and branch-2. Testing 
> with master next, will update afterward. ]
> Test cluster of 10 regionservers is configured each RS with 5 flush threads, 
> 5 large compaction threads, and 10 small compaction threads. 
> Hadoop is 3.3.2. Java is 11. HFiles are on HDFS. 
> All the StoreFileTracker implementations, DEFAULT or FILE, exhibit compaction 
> time store writer errors in an ingest heavy use case. Unit tests don't seem 
> to cover whatever this is. Most compactions succeed, but some do not. Those 
> that do not are failing with state or sanity check assertions. Below errors 
> are all from DEFAULT. They seem related... store writer instance 
> usage/close/locking issues during compactions.
> Warnings like "writer exists when it should not":
> {noformat}
> 2022-04-07T23:13:11,351 WARN  
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> compactions.Compactor: Writer exists when it should not: {
>   
> hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/b518f72941d4427e7e1923407643df67/.tmp/c/29d7b88c4c214ddcbba4f747514a2cf5
>  }
> {noformat}
> Errors like:
> IllegalStateException thrown from 
> HFileBlockIndex$BlockIndexWriter.shouldWriteBlock:
> {noformat}
> 2022-04-07T23:13:11,508 ERROR 
> [regionserver/ip-172-31-63-83:8120-shortCompactions-6] 
> regionserver.CompactSplit: Compaction failed 
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
>  storeName=b518f72941d4427e7e1923407643df67/c, priority=10, 
> startTime=1649373185476
> java.lang.IllegalStateException: curInlineChunk is null; has shouldWriteBlock 
> been called with closing=true and then called again?
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.shouldWriteBlock(HFileBlockIndex.java:1258)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.writeInlineBlocks(HFileWriterImpl.java:523)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:608)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:84)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:76)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:384)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
> and IllegalStateException thrown from HFileBlock$Writer.expectState:
> {noformat}
> 2022-04-07T23:13:11,559 ERROR 
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8] 
> regionserver.CompactSplit: Compaction failed 
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
>  storeName=b518f72941d4427e7e1923407643df67/c, priority=0, 
> startTime=1649373191325
> java.lang.IllegalStateException: Expected state: BLOCK_READY, actual state: 
> WRITING
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.expectState(HFileBlock.java:1190)
>  ~[hbase-server-2.5.0-SNA

[jira] [Commented] (HBASE-26938) Compaction failures after StoreFileTracker integration (branch-2, branch-2.5)

2022-04-08 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519626#comment-17519626
 ] 

Josh Elser commented on HBASE-26938:


Yeah, I can confirm that we were seeing compaction issues on an HBase 2.4 based 
(quite current) branch of HBase which looked very similar to the original 
HBASE-26675. My gut reaction was that it was likely something that 
cherry-picked cleanly but the locking has subtly changed compared to the master 
branch.

This was one of the issues we talked about yesterday morning. Not sure if 
anyone got to the bottom of it yet. Thanks for mentioning, Andrew.

> Compaction failures after StoreFileTracker integration (branch-2, branch-2.5)
> -
>
> Key: HBASE-26938
> URL: https://issues.apache.org/jira/browse/HBASE-26938
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 2.6.0
>Reporter: Andrew Kyle Purtell
>Priority: Blocker
> Fix For: 2.5.0
>
>
> [ Currently this has only been tested with branch-2.5 and branch-2. Testing 
> with master next, will update afterward. ]
> Test cluster of 10 regionservers is configured each RS with 5 flush threads, 
> 5 large compaction threads, and 10 small compaction threads. 
> Hadoop is 3.3.2. Java is 11. HFiles are on HDFS. 
> All the StoreFileTracker implementations, DEFAULT or FILE, exhibit compaction 
> time store writer errors in an ingest heavy use case. Unit tests don't seem 
> to cover whatever this is. Most compactions succeed, but some do not. Those 
> that do not are failing with state or sanity check assertions. Below errors 
> are all from DEFAULT. They seem related... store writer instance 
> usage/close/locking issues during compactions.
> Warnings like "writer exists when it should not":
> {noformat}
> 2022-04-07T23:13:11,351 WARN  
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8]
> compactions.Compactor: Writer exists when it should not: {
>   
> hdfs://ip-172-31-58-47.us-west-2.compute.internal:8020/hbase/data/default/IntegrationTestLoadCommonCrawl/b518f72941d4427e7e1923407643df67/.tmp/c/29d7b88c4c214ddcbba4f747514a2cf5
>  }
> {noformat}
> Errors like:
> IllegalStateException thrown from 
> HFileBlockIndex$BlockIndexWriter.shouldWriteBlock:
> {noformat}
> 2022-04-07T23:13:11,508 ERROR 
> [regionserver/ip-172-31-63-83:8120-shortCompactions-6] 
> regionserver.CompactSplit: Compaction failed 
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
>  storeName=b518f72941d4427e7e1923407643df67/c, priority=10, 
> startTime=1649373185476
> java.lang.IllegalStateException: curInlineChunk is null; has shouldWriteBlock 
> been called with closing=true and then called again?
> at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.shouldWriteBlock(HFileBlockIndex.java:1258)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.writeInlineBlocks(HFileWriterImpl.java:523)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:608)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:84)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.abortWriter(DefaultCompactor.java:76)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:384)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
>  ~[hbase-server-2.5.0-SNAPSHOT.jar:2.5.0-SNAPSHOT]
> {noformat}
> and IllegalStateException thrown from HFileBlock$Writer.expectState:
> {noformat}
> 2022-04-07T23:13:11,559 ERROR 
> [regionserver/ip-172-31-63-83:8120-shortCompactions-8] 
> regionserver.CompactSplit: Compaction failed 
> region=IntegrationTestLoadCommonCrawl,,1649373172576.b518f72941d4427e7e1923407643df67.,
>  st

[jira] [Resolved] (CALCITE-4971) update httpclient and httpcore to latest 5.1 release

2022-04-05 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/CALCITE-4971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved CALCITE-4971.
-
Fix Version/s: avatica-1.21.0
   Resolution: Fixed

> update httpclient and httpcore to latest 5.1 release
> 
>
> Key: CALCITE-4971
> URL: https://issues.apache.org/jira/browse/CALCITE-4971
> Project: Calcite
>  Issue Type: Improvement
>  Components: avatica
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
> Fix For: avatica-1.21.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Apache commons httpclient/httpcomponent 4.5 depend on commons-logging and not 
> slf4j. This means that phoenix-thin requires explicit log4j configuration to 
> work.
> We want all logging to go through SLF4j, and to be able to use any supported 
> backend.
> Based on an an offline conversation with [~elserj]
> As these are new major version, it's probably going to involve more than a 
> version bump.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26907) Update Hadoop3 versions for JEP 223 compliance

2022-04-04 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517158#comment-17517158
 ] 

Josh Elser commented on HBASE-26907:


{quote}Should I drop 2.3 from the matrix?
{quote}
+1

Thanks for digging through this compat stuff. Everything you've called out 
(here and on github) makes sense to me.

> Update Hadoop3 versions for JEP 223 compliance
> --
>
> Key: HBASE-26907
> URL: https://issues.apache.org/jira/browse/HBASE-26907
> Project: HBase
>  Issue Type: Task
>  Components: build
>Affects Versions: 2.5.0, 3.0.0-alpha-3
>Reporter: Nick Dimiduk
>Assignee: Nick Dimiduk
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.12
>
>
> It happened that my JDK version upgraded to 11.0.14.1. Running unit tests 
> involving the HDFS mini cluster now fails with a stack trace that ends with
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid Java version 11.0.14.1 
>   
> 
> at org.eclipse.jetty.util.JavaVersion.parseJDK9(JavaVersion.java:71)  
>   
> 
> at org.eclipse.jetty.util.JavaVersion.parse(JavaVersion.java:49)  
>   
> 
> at org.eclipse.jetty.util.JavaVersion.(JavaVersion.java:43)
> {noformat}
> We are using hadoop-3.2.0, which uses jetty-9.3.24. This is a Jetty issue has 
> been fixed upstream in Jetty via 
> https://github.com/eclipse/jetty.project/issues/2090. Hadoop has upgraded its 
> Jetty version to 9.4.20 in HADOOP-16152, which is available as of 
> hadoop-3.2.2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26588) Implement a migration tool to help users migrate SFT implementation for a large set of tables

2022-04-04 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26588.

Resolution: Later

Closing since we have HBASE-26673. Can re-open this if we have a reason that 
HBASE-26673 is insufficient.

> Implement a migration tool to help users migrate SFT implementation for a 
> large set of tables
> -
>
> Key: HBASE-26588
> URL: https://issues.apache.org/jira/browse/HBASE-26588
> Project: HBase
>  Issue Type: Sub-task
>  Components: tooling
>Reporter: Duo Zhang
>Priority: Major
>
> It will be very useful for our users who deploy HBase on S3 like systems.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26588) Implement a migration tool to help users migrate SFT implementation for a large set of tables

2022-04-04 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26588.

Resolution: Later

Closing since we have HBASE-26673. Can re-open this if we have a reason that 
HBASE-26673 is insufficient.

> Implement a migration tool to help users migrate SFT implementation for a 
> large set of tables
> -
>
> Key: HBASE-26588
> URL: https://issues.apache.org/jira/browse/HBASE-26588
> Project: HBase
>  Issue Type: Sub-task
>  Components: tooling
>Reporter: Duo Zhang
>Priority: Major
>
> It will be very useful for our users who deploy HBase on S3 like systems.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (CALCITE-5082) Ensure client_reference updated on site for user-facing properties

2022-04-01 Thread Josh Elser (Jira)
Josh Elser created CALCITE-5082:
---

 Summary: Ensure client_reference updated on site for user-facing 
properties
 Key: CALCITE-5082
 URL: https://issues.apache.org/jira/browse/CALCITE-5082
 Project: Calcite
  Issue Type: Task
  Components: site
Reporter: Josh Elser
Assignee: Josh Elser


In CALCITE-5009 's code review, I noticed that the client reference is out of 
date. Give it a refresh.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (CALCITE-5082) Ensure client_reference updated on site for user-facing properties

2022-04-01 Thread Josh Elser (Jira)
Josh Elser created CALCITE-5082:
---

 Summary: Ensure client_reference updated on site for user-facing 
properties
 Key: CALCITE-5082
 URL: https://issues.apache.org/jira/browse/CALCITE-5082
 Project: Calcite
  Issue Type: Task
  Components: site
Reporter: Josh Elser
Assignee: Josh Elser


In CALCITE-5009 's code review, I noticed that the client reference is out of 
date. Give it a refresh.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (CALCITE-5009) Transparent JDBC connection re-creation may lead to data loss

2022-04-01 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/CALCITE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved CALCITE-5009.
-
Fix Version/s: avatica-1.21.0
   Resolution: Fixed

> Transparent JDBC connection re-creation may lead to data loss
> -
>
> Key: CALCITE-5009
> URL: https://issues.apache.org/jira/browse/CALCITE-5009
> Project: Calcite
>  Issue Type: Bug
>  Components: avatica
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
> Fix For: avatica-1.21.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, if the server-side JDBC connection goes away because it is expored 
> from the server-side connection cache we attempt to transparently create a 
> new "real" JDBC connection, and continue using that instead of the original 
> connection
> [https://github.com/apache/calcite-avatica/blob/fbdcc62745a0e8920db759fb6bdce564d854e407/core/src/main/java/org/apache/calcite/avatica/AvaticaConnection.java#L796]
> This is fine for most read-only connections, but it can break transaction 
> semantics, which is captured in the "real" connection object.
> {noformat}
> conn.setAutocommit(false)
> stmt = conn.createStatement()
> execute(insert A)
> //Connection lost and object recreated which now proxies a new "real" 
> connection
> execute(insert B)
> conn.commit()
> //We have lost "insert A"{noformat}
> I'm not sure if we synchronize autocommit state of the new connection to the 
> lost one or not, but it's bad either way.
>  
> We should either completely drop this feature, add some logic that avoids it 
> if there is an open transaction and/or only allow it for connections that 
> have the readOnly flag set.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CALCITE-5009) Transparent JDBC connection re-creation may lead to data loss

2022-04-01 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516071#comment-17516071
 ] 

Josh Elser commented on CALCITE-5009:
-

Of course! I was the one who wrote this apparently half-baked idea :). Happy to 
review.

> Transparent JDBC connection re-creation may lead to data loss
> -
>
> Key: CALCITE-5009
> URL: https://issues.apache.org/jira/browse/CALCITE-5009
> Project: Calcite
>  Issue Type: Bug
>  Components: avatica
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, if the server-side JDBC connection goes away because it is expored 
> from the server-side connection cache we attempt to transparently create a 
> new "real" JDBC connection, and continue using that instead of the original 
> connection
> [https://github.com/apache/calcite-avatica/blob/fbdcc62745a0e8920db759fb6bdce564d854e407/core/src/main/java/org/apache/calcite/avatica/AvaticaConnection.java#L796]
> This is fine for most read-only connections, but it can break transaction 
> semantics, which is captured in the "real" connection object.
> {noformat}
> conn.setAutocommit(false)
> stmt = conn.createStatement()
> execute(insert A)
> //Connection lost and object recreated which now proxies a new "real" 
> connection
> execute(insert B)
> conn.commit()
> //We have lost "insert A"{noformat}
> I'm not sure if we synchronize autocommit state of the new connection to the 
> lost one or not, but it's bad either way.
>  
> We should either completely drop this feature, add some logic that avoids it 
> if there is an open transaction and/or only allow it for connections that 
> have the readOnly flag set.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26667) Integrate user-experience for hbase-client

2022-03-25 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26667:
---
Hadoop Flags: Reviewed
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

> Integrate user-experience for hbase-client
> --
>
> Key: HBASE-26667
> URL: https://issues.apache.org/jira/browse/HBASE-26667
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Josh Elser
>Assignee: Andor Molnar
>Priority: Major
> Fix For: HBASE-26553
>
>
> Today, we have two mechanism in order to get the tokens needed to 
> authenticate:
>  # Kerberos, we rely on a Kerberos ticket being present in a well-known 
> location (defined by JVM properties) or via programmatic invocation of 
> UserGroupInformation
>  # Delegation tokens, we rely on special API to be called (our mapreduce API) 
> which loads the token into the current UserGroupInformation "context" (the 
> JAAS PrivilegedAction).
> The JWT bearer token approach is very similar to the delegation token 
> mechanism, but HBase does not generate this JWT (as we do with delegation 
> tokens). How does a client provide this token to the hbase-client (i.e. 
> {{ConnectionFactory.getConnection()}} or a {{UserGroupInformation}} call)? We 
> should be mindful of all of the different "entrypoints" to HBase ({{{}hbase 
> ...{}}} commands, {{java -cp}} commands, Phoenix commands, Spark comands, 
> etc). Our solution should be effective for all of these approaches and not 
> require downstream changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: A tweak to our checkstyle configuration

2022-03-20 Thread Josh Elser

Going through my inbox...

1. Great to have tooling which can validate (and fix) code which is not 
currently to style.
2. I would prefer a style guide (such as Google's Java Style) which is 
"generally accepted" by the Java industry at large and we can use as-is. 
However, I don't feel strongly on this.
3. I have no objections to Nick's original ask to allow one-line if 
blocks on the same line of code with brackets.
4. I prefer brackets for one line if/else blocks over no-brackets for 
the same (as Andrew indicates about avoiding dangling if-else blocks), 
but would not -1 a change if the majority felt otherwise.


On 1/16/22 3:01 AM, 张铎(Duo Zhang) wrote:

On enforcing the coding standards, I've filed HBASE-26617, to introduce the
spotless plugin to HBase.

We can add 'mvn spotless:check'  to our pre commit checks, so we can
enforce the coding standards.

And 'mvn spotless:apply' will format everything for you.

Andrew Purtell  于2022年1月16日周日 07:39写道:


There are a handful of anti patterns to avoid, like dangling if-elses.
(Always use braces around code blocks!) Otherwise we have been following
the Java basic guidelines with modifications for indent width and maximum
line length and I see no pressing reason why this needs to change. Happy
with the status quo. That said I see no reason to reject Nicks’s small
proposed changes. We definitely don’t need to adopt a totally different
style guide in response to a modest proposal. This seems out of proportion
to the ask.

If we are going to change checkstyle rules it would be necessary for the
proposer to provide a linter for the rest of us to use as well as a Yetus
precommit phase that implements the checks. Otherwise it would be a half
completed proposal and worse than making no changes. Please also provide
HOWTOs for configuring the IDEA and Eclipse IDEs.


On Jan 15, 2022, at 1:07 AM, 张铎  wrote:

What about just switching to use google java style?

Nick Dimiduk  于2022年1月13日周四 03:22写道:


Hey all.

Discussion on the PR has resulted in an impasse of opinion, but also
renewed interest in improvements to static analysis in general
(HBASE-26617).

I think that this kind of code hygiene is very important for the

long-term

maintenance of a large project like ours and especially one that accepts
contributions from a broad audience. I would really appreciate it if

some

more folks would chime into these discussions on PRs, or bring your
concerns back up to this thread. I'm game to help see the work done,

but we

need more voices to participate in defining what is required by the
community.

Thanks in advance,
Nick


On Thu, Dec 9, 2021 at 3:58 PM Nick Dimiduk 

wrote:


Heya,

I have posted a small change to our checkstyle configuration on
HBASE-26536. This change will relax the whitespace rules regarding the
left-curly-bracket ('{') character. Specifically, I intend this change

to

allow short expressions that include a nested scope that fits entirely

on

one line. The example I provide is:

if (foo == null) { return null; }

This whitespace style is already present (though I think not in popular
usage) within the codebase. Please take a look and let me know if you

have

any concerns about making this change.

Thanks,
Nick

https://issues.apache.org/jira/browse/HBASE-26536
https://github.com/apache/hbase/pull/3913









[jira] [Comment Edited] (HBASE-26791) Memstore flush fencing issue for SFT

2022-03-12 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505292#comment-17505292
 ] 

Josh Elser edited comment on HBASE-26791 at 3/12/22, 4:22 PM:
--

{quote}isn't the broader issue here the fact RS1 doesn't abort immediately upon 
the loss of its ZK lock? Shouldn't we rather ensure an RS abort is triggered 
and all ongoing operations (including any hstore flushes) are interrupted right 
away?
{quote}
-Yes and no. In normal cases, yeah, we should just be able to interrupt the 
threads and expect them all to exit gracefully. However, when you start to 
consider JVM pauses and the like, it's non-deterministic if we can expect one 
thread in the RS to notice that we lost the RS lock, send an interrupt to all 
other flush/compaction threads, and then those threads to notice and take 
action on that.-

-If we can avoid it another way, there's value in that.-

edit: I really have to get better at making sure I refresh the page before 
commenting :(


was (Author: elserj):
{quote}isn't the broader issue here the fact RS1 doesn't abort immediately upon 
the loss of its ZK lock? Shouldn't we rather ensure an RS abort is triggered 
and all ongoing operations (including any hstore flushes) are interrupted right 
away?
{quote}
Yes and no. In normal cases, yeah, we should just be able to interrupt the 
threads and expect them all to exit gracefully. However, when you start to 
consider JVM pauses and the like, it's non-deterministic if we can expect one 
thread in the RS to notice that we lost the RS lock, send an interrupt to all 
other flush/compaction threads, and then those threads to notice and take 
action on that.

If we can avoid it another way, there's value in that.

> Memstore flush fencing issue for SFT
> 
>
> Key: HBASE-26791
> URL: https://issues.apache.org/jira/browse/HBASE-26791
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.6.0, 3.0.0-alpha-3
>Reporter: Szabolcs Bukros
>Assignee: Duo Zhang
>Priority: Major
>
> The scenarios is the following:
>  # rs1 is flushing file to S3 for region1
>  # rs1 loses ZK lock
>  # region1 gets assigned to rs2
>  # rs2 opens region1
>  # rs1 completes flush and updates sft file for region1
>  # rs2 has a different “version” of the sft file for region1
> The flush should fail at the end, but the SFT file gets overwritten before 
> that, resulting in potential data loss.
>  
> Potential solutions include:
>  * Adding timestamp to the tracker file names. This and creating a new 
> tracker file when an rs open the region would allow us to list available 
> tracker files before an update and compare the found timestamps to the one 
> stored in memory to verify the store still owns the latest tracker file
>  * Using the existing timestamp in the tracker file content. This would also 
> require us to create a new tracker file when a new rs opens the region, but 
> instead of listing the available tracker files, we could try to load and 
> de-serialize the last tracker file and compare the timestamp found in it to 
> the one stored in memory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26791) Memstore flush fencing issue for SFT

2022-03-12 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505292#comment-17505292
 ] 

Josh Elser commented on HBASE-26791:


{quote}isn't the broader issue here the fact RS1 doesn't abort immediately upon 
the loss of its ZK lock? Shouldn't we rather ensure an RS abort is triggered 
and all ongoing operations (including any hstore flushes) are interrupted right 
away?
{quote}
Yes and no. In normal cases, yeah, we should just be able to interrupt the 
threads and expect them all to exit gracefully. However, when you start to 
consider JVM pauses and the like, it's non-deterministic if we can expect one 
thread in the RS to notice that we lost the RS lock, send an interrupt to all 
other flush/compaction threads, and then those threads to notice and take 
action on that.

If we can avoid it another way, there's value in that.

> Memstore flush fencing issue for SFT
> 
>
> Key: HBASE-26791
> URL: https://issues.apache.org/jira/browse/HBASE-26791
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.6.0, 3.0.0-alpha-3
>Reporter: Szabolcs Bukros
>Assignee: Duo Zhang
>Priority: Major
>
> The scenarios is the following:
>  # rs1 is flushing file to S3 for region1
>  # rs1 loses ZK lock
>  # region1 gets assigned to rs2
>  # rs2 opens region1
>  # rs1 completes flush and updates sft file for region1
>  # rs2 has a different “version” of the sft file for region1
> The flush should fail at the end, but the SFT file gets overwritten before 
> that, resulting in potential data loss.
>  
> Potential solutions include:
>  * Adding timestamp to the tracker file names. This and creating a new 
> tracker file when an rs open the region would allow us to list available 
> tracker files before an update and compare the found timestamps to the one 
> stored in memory to verify the store still owns the latest tracker file
>  * Using the existing timestamp in the tracker file content. This would also 
> require us to create a new tracker file when a new rs opens the region, but 
> instead of listing the available tracker files, we could try to load and 
> de-serialize the last tracker file and compare the timestamp found in it to 
> the one stored in memory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PHOENIX-3654) Client-side PQS discovery for thin client

2022-03-09 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/PHOENIX-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503867#comment-17503867
 ] 

Josh Elser commented on PHOENIX-3654:
-

I've come back to change the name of this Jira issue because it continues to 
give me grief. People see this and believe that this Jira issue gives some 
support for a load balancer in front of PQS which is not what this commit 
actually does.

This commit does client-based service discovery of PQS instances which are 
registered in ZK. This commit does not enable any ability for HTTP load 
balancers (e.g. haproxy, httpd, F5, etc) to be used with PQS.

> Client-side PQS discovery for thin client
> -
>
> Key: PHOENIX-3654
> URL: https://issues.apache.org/jira/browse/PHOENIX-3654
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.8.0
> Environment: Linux 3.13.0-107-generic kernel, v4.9.0-HBase-0.98
>Reporter: Rahul Shrivastava
>Assignee: Rahul Shrivastava
>Priority: Major
> Fix For: 4.12.0
>
> Attachments: LoadBalancerDesign.pdf, Loadbalancer.patch
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> We have been having internal discussion on load balancer for thin client for 
> PQS. The general consensus we have is to have an embedded load balancer with 
> the thin client instead of using external load balancer such as haproxy. The 
> idea is to not to have another layer between client and PQS. This reduces 
> operational cost for system, which currently leads to delay in executing 
> projects.
> But this also comes with challenge of having an embedded load balancer which 
> can maintain sticky sessions, do fair load balancing knowing the load 
> downstream of PQS server. In addition, load balancer needs to know location 
> of multiple PQS server. Now, the thin client needs to keep track of PQS 
> servers via zookeeper ( or other means). 
> In the new design, the client ( PQS client) , it is proposed to  have an 
> embedded load balancer.
> Where will the load Balancer sit ?
> The load load balancer will embedded within the app server client.  
> How will the load balancer work ? 
> Load balancer will contact zookeeper to get location of PQS. In this case, 
> PQS needs to register to ZK itself once it comes online. Zookeeper location 
> is in hbase-site.xml. It will maintain a small cache of connection to the 
> PQS. When a request comes in, it will check for an open connection from the 
> cache. 
> How will load balancer know load on PQS ?
> To start with, it will pick a random open connection to PQS. This means that 
> load balancer does not know PQS load. Later , we can augment the code so that 
> thin client can receive load info from PQS and make intelligent decisions.  
> How will load balancer maintain sticky sessions ?
> While we still need to investigate how to implement sticky sessions. We can 
> look for some open source implementation for the same.
> How will PQS register itself to service locator ?
> PQS will have location of zookeeper in hbase-site.xml and it would register 
> itself to the zookeeper. Thin client will find out PQS location using 
> zookeeper.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PHOENIX-3654) Client-side PQS discovery for thin client

2022-03-09 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated PHOENIX-3654:

Summary: Client-side PQS discovery for thin client  (was: Load Balancer for 
thin client)

> Client-side PQS discovery for thin client
> -
>
> Key: PHOENIX-3654
> URL: https://issues.apache.org/jira/browse/PHOENIX-3654
> Project: Phoenix
>  Issue Type: New Feature
>Affects Versions: 4.8.0
> Environment: Linux 3.13.0-107-generic kernel, v4.9.0-HBase-0.98
>Reporter: Rahul Shrivastava
>Assignee: Rahul Shrivastava
>Priority: Major
> Fix For: 4.12.0
>
> Attachments: LoadBalancerDesign.pdf, Loadbalancer.patch
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> We have been having internal discussion on load balancer for thin client for 
> PQS. The general consensus we have is to have an embedded load balancer with 
> the thin client instead of using external load balancer such as haproxy. The 
> idea is to not to have another layer between client and PQS. This reduces 
> operational cost for system, which currently leads to delay in executing 
> projects.
> But this also comes with challenge of having an embedded load balancer which 
> can maintain sticky sessions, do fair load balancing knowing the load 
> downstream of PQS server. In addition, load balancer needs to know location 
> of multiple PQS server. Now, the thin client needs to keep track of PQS 
> servers via zookeeper ( or other means). 
> In the new design, the client ( PQS client) , it is proposed to  have an 
> embedded load balancer.
> Where will the load Balancer sit ?
> The load load balancer will embedded within the app server client.  
> How will the load balancer work ? 
> Load balancer will contact zookeeper to get location of PQS. In this case, 
> PQS needs to register to ZK itself once it comes online. Zookeeper location 
> is in hbase-site.xml. It will maintain a small cache of connection to the 
> PQS. When a request comes in, it will check for an open connection from the 
> cache. 
> How will load balancer know load on PQS ?
> To start with, it will pick a random open connection to PQS. This means that 
> load balancer does not know PQS load. Later , we can augment the code so that 
> thin client can receive load info from PQS and make intelligent decisions.  
> How will load balancer maintain sticky sessions ?
> While we still need to investigate how to implement sticky sessions. We can 
> look for some open source implementation for the same.
> How will PQS register itself to service locator ?
> PQS will have location of zookeeper in hbase-site.xml and it would register 
> itself to the zookeeper. Thin client will find out PQS location using 
> zookeeper.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26791) Memstore flush fencing issue for SFT

2022-03-09 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503647#comment-17503647
 ] 

Josh Elser commented on HBASE-26791:


ICYMI [~zhangduo] 

> Memstore flush fencing issue for SFT
> 
>
> Key: HBASE-26791
> URL: https://issues.apache.org/jira/browse/HBASE-26791
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.6.0, 3.0.0-alpha-3
>Reporter: Szabolcs Bukros
>Priority: Major
>
> The scenarios is the following:
>  # rs1 is flushing file to S3 for region1
>  # rs1 loses ZK lock
>  # region1 gets assigned to rs2
>  # rs2 opens region1
>  # rs1 completes flush and updates sft file for region1
>  # rs2 has a different “version” of the sft file for region1
> The flush should fail at the end, but the SFT file gets overwritten before 
> that, resulting in potential data loss.
>  
> Potential solutions include:
>  * Adding timestamp to the tracker file names. This and creating a new 
> tracker file when an rs open the region would allow us to list available 
> tracker files before an update and compare the found timestamps to the one 
> stored in memory to verify the store still owns the latest tracker file
>  * Using the existing timestamp in the tracker file content. This would also 
> require us to create a new tracker file when a new rs opens the region, but 
> instead of listing the available tracker files, we could try to load and 
> de-serialize the last tracker file and compare the timestamp found in it to 
> the one stored in memory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] operator tools, HBase 3 and StoreFileTracking

2022-03-01 Thread Josh Elser
I tend to lean towards what Andrew is saying here, but I will also admit 
that this is in part from not having a good user-experience about 
getting up an HMaster in maintenance mode to do surgical stuff (feels 
like two steps instead of just one).


Naively, rebuilding the SFT meta files from the filesystem doesn't 
require the HMaster to be up because there isn't any other "state" to 
consider (which was a big reason behind pushing the work that hbck2 was 
doing into the active master to avoid split-brain).


Is doing logic in HBCK2 that doesn't talk to the HMaster a -1 from you, 
Duo? Similarly, is a utility in hbase-operator-tools (not a part of the 
hbck2 wrapper command) also a -1?


Either are feasible, but I do think trying to build this SFT 
rebuilding/recovery into a maintenance-mode HMaster will be more work.


On 2/21/22 12:27 PM, Andrew Purtell wrote:

There are some recovery cases where the cluster cannot be expected to be up
and running. What happens if we have no tooling for those? The user has a
dead cluster. So I don't think a requirement that the cluster be up and
running always is sufficient. For this type of recovery operator-tools must
be able to parse and write on disk formats. On the other hand hopefully the
cases for which that is not true are rare. In HBase 1, we had
OffineMetaRebuild. For my operations occasionally it has been necessary, in
test environments especially where users are not always clueful, and it has
shortened incident time from many hours to less than one hour. The
alternative would have been rebuild from scratch with total data loss,
which is a totally unsatisfying user experience.


On Sun, Feb 20, 2022 at 4:29 AM 张铎(Duo Zhang)  wrote:


Sorry a bit late...

IIRC, the design of HBCK2 is that, most of the actual fix logic should be
done inside hbase(usually as a procedure), and the hbase-operator-tools is
just a facade for calling these methods. It will query the cluster to find
out which features are supportted. So in general, the design here is to
always have the cluster up when fixing. We have a maintenance mode where we
will just bring up HMaster and make meta table online, without loading any
other regions.

So I prefer we just use snapshot dependencies of hbase in HBCK2. It is not
a big deal for end users as if we have not make the release yet, the new
fixing options can never be actually used against a production cluster.

Anyway, this means we need to publish nightly builds then.

Thanks.

Peter Somogyi  于2022年2月18日周五 06:40写道:


Makes sense. Thanks Andrew for clarifying!

On Thu, Feb 17, 2022, 21:28 Andrew Purtell  wrote:


On Thu, Feb 17, 2022 at 12:19 PM Peter Somogyi 
wrote:


I like the idea of including the store file tracking in 2.5.0 to

unblock

the HBCK development efforts.

Unfortunately, I was not following its development that much. Can it

cause

any issues if 2.5.0 has the feature but later an incompatible change

is

needed for SFT? Can it be marked as a beta feature where we are free

to

modify interfaces?



Yes, this is what I meant when I suggested we could mark it as
'experimental'. We have done this in the past. The word 'experimental'

is

prominently included adjacent to any discussion of the feature in
documentation and release notes. When we feel for sure it is stable

that

word is removed. We can do something different this time of course but

that

has been our past practice when introducing new functionality into
releasing code lines. And I presume we would use the Evolving interface
annotation everywhere.

Peter


On Tue, Feb 15, 2022 at 11:07 PM Andrew Purtell <

andrew.purt...@gmail.com>

wrote:


Another option which I do not see mentioned yet is to extract the

relevant

common proto and source files from the ‘hbase’ repository into a

new

repository (‘hbase-storage’?), from which we would release

artifacts

to

be

consumed by both hbase and hbase-operator-tools. This maintains

D.R.Y.

through refactoring although it may down the road cause some

complexity

in

coordinating evolution among the three (if not more) repositories

and

releases produced from them. This is like Josh’s Option 1 but

without

duplication.

Regarding the option 2 issue… If it would help we can drop SFT into
branch-2.5 along with the log4j2 changes and release 2.5.0

afterward.

We

are taking the opportunity of this minor increment to accelerate

log4j1

retirement, which is why it’s still waiting (but not for long). We

can

use

the same opportunity to release SFT even if we designate it as an
experimental feature if that would simplify some other logistics.

For

what

it’s worth.


On Feb 15, 2022, at 7:44 AM, Josh Elser 

wrote:


I was talking with Szabolcs prior to him sending this one, and

it's

a

tricky issue for sure.


To date, we've solved any HBase API issues by copying code into

HBCK2

e.g. HBCKMetaTableAccessor which copies parts of MetaTableAccessor,

or

we

push the logic down server-side to 

[jira] [Commented] (HBASE-26777) BufferedDataBlockEncoder$OffheapDecodedExtendedCell.deepClone throws UnsupportedOperationException

2022-02-28 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499160#comment-17499160
 ] 

Josh Elser commented on HBASE-26777:


I can see how throwing an UnsupportedOperationException in HBASE-26036 to catch 
anyone else doing this "works", but is pretty aggressive. If there is API that 
we can't safely support (given how we want the semantics of that API to work, 
e.g. #get() does an on-heap copy and we want to avoid on-heap copies), it would 
be better to remove the API than let it fail later.

> BufferedDataBlockEncoder$OffheapDecodedExtendedCell.deepClone throws 
> UnsupportedOperationException
> --
>
> Key: HBASE-26777
> URL: https://issues.apache.org/jira/browse/HBASE-26777
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 2.4.10
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>
> BufferedDataBlockEncoder$OffheapDecodedExtendedCell.deepClone throws an 
> unsupportedException.
> However, org.apache.hadoop.hbase.regionserver.HRegion.get(Get, boolean, long, 
> long)
> calls the method:
> {code:java}
>       // Copy EC to heap, then close the scanner.
>       // This can be an EXPENSIVE call. It may make an extra copy from 
> offheap to onheap buffers.
>       // See more details in HBASE-26036.
>   for (Cell cell : tmp) {
>         results.add(cell instanceof ByteBufferExtendedCell ?
>           ((ByteBufferExtendedCell) cell).deepClone(): cell);
>       } {code}
> According to the comment above, this is probably caused by HBASE-26036.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26767) Rest server should not use a large Header Cache.

2022-02-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26767.

Hadoop Flags: Reviewed
  Resolution: Fixed

Pushed! Thanks for the great work, Sergey.

> Rest server should not use a large Header Cache.
> 
>
> Key: HBASE-26767
> URL: https://issues.apache.org/jira/browse/HBASE-26767
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 2.4.9
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> In the RESTServer we set the HeaderCache size to DEFAULT_HTTP_MAX_HEADER_SIZE 
> (65536). That's not compatible with jetty-9.4.x because the cache size is 
> limited by Character.MAX_VALUE - 1  (65534) there. According to the Jetty 
> source code comments, it's possible to have a buffer overflow in the cache 
> for higher values and that might lead to wrong/incomplete values returned by 
> cache and following incorrect header handling.  
> There are a couple of ways to fix it:
> 1. change the value of DEFAULT_HTTP_MAX_HEADER_SIZE to 65534
> 2. make header cache size configurable and set its size separately from the 
> header size. 
> I believe that the second would give us more flexibility.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26767) Rest server should not use a large Header Cache.

2022-02-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26767.

Hadoop Flags: Reviewed
  Resolution: Fixed

Pushed! Thanks for the great work, Sergey.

> Rest server should not use a large Header Cache.
> 
>
> Key: HBASE-26767
> URL: https://issues.apache.org/jira/browse/HBASE-26767
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 2.4.9
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> In the RESTServer we set the HeaderCache size to DEFAULT_HTTP_MAX_HEADER_SIZE 
> (65536). That's not compatible with jetty-9.4.x because the cache size is 
> limited by Character.MAX_VALUE - 1  (65534) there. According to the Jetty 
> source code comments, it's possible to have a buffer overflow in the cache 
> for higher values and that might lead to wrong/incomplete values returned by 
> cache and following incorrect header handling.  
> There are a couple of ways to fix it:
> 1. change the value of DEFAULT_HTTP_MAX_HEADER_SIZE to 65534
> 2. make header cache size configurable and set its size separately from the 
> header size. 
> I believe that the second would give us more flexibility.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26767) Rest server should not use a large Header Cache.

2022-02-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26767:
---
Fix Version/s: 2.5.0

> Rest server should not use a large Header Cache.
> 
>
> Key: HBASE-26767
> URL: https://issues.apache.org/jira/browse/HBASE-26767
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 2.4.9
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> In the RESTServer we set the HeaderCache size to DEFAULT_HTTP_MAX_HEADER_SIZE 
> (65536). That's not compatible with jetty-9.4.x because the cache size is 
> limited by Character.MAX_VALUE - 1  (65534) there. According to the Jetty 
> source code comments, it's possible to have a buffer overflow in the cache 
> for higher values and that might lead to wrong/incomplete values returned by 
> cache and following incorrect header handling.  
> There are a couple of ways to fix it:
> 1. change the value of DEFAULT_HTTP_MAX_HEADER_SIZE to 65534
> 2. make header cache size configurable and set its size separately from the 
> header size. 
> I believe that the second would give us more flexibility.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26767) Rest server should not use a large Header Cache.

2022-02-23 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497048#comment-17497048
 ] 

Josh Elser commented on HBASE-26767:


I also talked to Sergey off-Jira who said that branch-1 doesn't have this 
header cache size set, so not targeting a corresponding branch-1 change there.

> Rest server should not use a large Header Cache.
> 
>
> Key: HBASE-26767
> URL: https://issues.apache.org/jira/browse/HBASE-26767
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 2.4.9
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
> Fix For: 3.0.0-alpha-3, 2.4.10
>
>
> In the RESTServer we set the HeaderCache size to DEFAULT_HTTP_MAX_HEADER_SIZE 
> (65536). That's not compatible with jetty-9.4.x because the cache size is 
> limited by Character.MAX_VALUE - 1  (65534) there. According to the Jetty 
> source code comments, it's possible to have a buffer overflow in the cache 
> for higher values and that might lead to wrong/incomplete values returned by 
> cache and following incorrect header handling.  
> There are a couple of ways to fix it:
> 1. change the value of DEFAULT_HTTP_MAX_HEADER_SIZE to 65534
> 2. make header cache size configurable and set its size separately from the 
> header size. 
> I believe that the second would give us more flexibility.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26767) Rest server should not use a large Header Cache.

2022-02-23 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26767:
---
Fix Version/s: 3.0.0-alpha-3
   2.4.10

> Rest server should not use a large Header Cache.
> 
>
> Key: HBASE-26767
> URL: https://issues.apache.org/jira/browse/HBASE-26767
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 2.4.9
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
> Fix For: 3.0.0-alpha-3, 2.4.10
>
>
> In the RESTServer we set the HeaderCache size to DEFAULT_HTTP_MAX_HEADER_SIZE 
> (65536). That's not compatible with jetty-9.4.x because the cache size is 
> limited by Character.MAX_VALUE - 1  (65534) there. According to the Jetty 
> source code comments, it's possible to have a buffer overflow in the cache 
> for higher values and that might lead to wrong/incomplete values returned by 
> cache and following incorrect header handling.  
> There are a couple of ways to fix it:
> 1. change the value of DEFAULT_HTTP_MAX_HEADER_SIZE to 65534
> 2. make header cache size configurable and set its size separately from the 
> header size. 
> I believe that the second would give us more flexibility.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26767) Rest server should not use a large Header Cache.

2022-02-23 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496947#comment-17496947
 ] 

Josh Elser commented on HBASE-26767:


Paraphrasing (please correct me if I get this wrong), but the manifestation of 
this issue is authorization failures as result of this cache "breaking". 
Specifically, you observed issues where the headers were getting malformed 
(specifically, the Authorization SPNEGO header).

In short, this was breaking basic authentication against the REST server.

> Rest server should not use a large Header Cache.
> 
>
> Key: HBASE-26767
> URL: https://issues.apache.org/jira/browse/HBASE-26767
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Affects Versions: 2.4.9
>Reporter: Sergey Soldatov
>Assignee: Sergey Soldatov
>Priority: Major
>
> In the RESTServer we set the HeaderCache size to DEFAULT_HTTP_MAX_HEADER_SIZE 
> (65536). That's not compatible with jetty-9.4.x because the cache size is 
> limited by Character.MAX_VALUE - 1  (65534) there. According to the Jetty 
> source code comments, it's possible to have a buffer overflow in the cache 
> for higher values and that might lead to wrong/incomplete values returned by 
> cache and following incorrect header handling.  
> There are a couple of ways to fix it:
> 1. change the value of DEFAULT_HTTP_MAX_HEADER_SIZE to 65534
> 2. make header cache size configurable and set its size separately from the 
> header size. 
> I believe that the second would give us more flexibility.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] deprecating jdk8, progress on LTS jdk support

2022-02-17 Thread Josh Elser

On 2/16/22 12:24 AM, Sean Busbey wrote:

Regarding the original question, I would be in favor of the proposal. Time
marches on. I assume just to state the obvious that our destination of
minimum LTS would shift from 8 to 11.


Yes, sorry I should have expressly stated JDK11 would become the minimum
with some release after HBase 3.

I got here because I wanted to start working on qualifying JDK17 as a
runtime environment but then realized we were putting more caveats on JDK11
than I expected.

Hadoop 2 isn’t exactly dead, at least the source branch is still receiving

occasional update, but is not releasing. We should probably consider it
effectively EOL.


IIRC we've already dropped Hadoop 2 support for HBase 3.


Correct.


The Hadoop minimum could become 3.3. The primary consideration to my mind

is the state of S3A: in what version it can be said to be stable and
feature complete. I think 3.3 is the appropriate code line for that
criteria but perhaps 3.2 could serve as well.


I really like this as a criteria. Anyone else have an idea on this?


I believe we've been benefiting from S3A changes from Hadoop 3.3 inside 
at Cloudera already. However, I believe that we'll actually see more 
"pains" once we get the storefile tracking feature solid (whereas today, 
transient/perf problems we might face in S3A would be hidden by the fact 
that we're doubling our I/O costs on compaction, memstore flushes, etc).


I have not been following super-closely, but let me see if I can bring 
this in front of Steve or someone else from Cloudera to chime in.


Re: [DISCUSS] deprecating jdk8, progress on LTS jdk support

2022-02-15 Thread Josh Elser

Deprecating jdk8 for HBase 3 and requiring minJdk=11 seems reasonable to me.

Gotta start pushing the issue somehow.

On 2/15/22 1:47 PM, Sean Busbey wrote:

Hi folks!

It's been some time since we decided to stick to LTS JDK releases as a way
of getting a handle on the JDK treadmill.

What do folks think about deprecating JDK8? The openjdk8u project is still
going and there are commercial support options at least through 2030.

Deprecating it in HBase 3 would mean we could remove it in HBase 4, not
that we would _have_ to remove it. The way I think about likely timing of
these events goes like this:

* HBase 2 started alphas in June 2017, betas in January 2018, and came out
in April 2018
* HBase 3 started alphas in July 2021, and as of Feb 2022 we haven't
discussed how close we are to our stated beta goals (upgrades from active
2.x releases and removal of not-ready features).

Given the above, in the absence of us specifically pushing to roll through
major version numbers for some reason, I think a reasonably conservative
estimate is for HBase 3 to arrive in late 2022 or early 2023 and then HBase
4 to start alphas in ~2025. An HBase 5 prior to 2030 seems unlikely.

That all said, our current reference guide section on java versions does
not sound very confident about JDK11 support.


A Note on JDK11 *
Preliminary support for JDK11 is introduced with HBase 2.3.0. This

support is limited to

compilation and running the full test suite. There are open questions

regarding the runtime

compatibility of JDK11 with Apache ZooKeeper and Apache Hadoop

(HADOOP-15338).

Significantly, neither project has yet released a version with explicit

runtime support for

JDK11. The remaining known issues in HBase are catalogued in HBASE-22972.



Since that blurb was written, Hadoop has added JDK11 support [1] as has
ZooKeeper[2]. As a part of buttoning up our JDK11 support we could update
our minimum supported versions of these projects to match that support.

What do folks think?

[1]: https://hadoop.apache.org/docs/r3.3.0/index.html
[2]:
https://zookeeper.apache.org/doc/r3.6.0/zookeeperAdmin.html#sc_systemReq



Re: [DISCUSS] operator tools, HBase 3 and StoreFileTracking

2022-02-15 Thread Josh Elser
I was talking with Szabolcs prior to him sending this one, and it's a 
tricky issue for sure.


To date, we've solved any HBase API issues by copying code into HBCK2 
e.g. HBCKMetaTableAccessor which copies parts of MetaTableAccessor, or 
we push the logic down server-side to the HBase Master and invoke it 
over the Hbck RPC interface.


I definitely want to avoid HBase version specific builds of the 
operator-tools, so that is not an option in my mind for 2.x. The 
discussions we had (that I remember) around HBCK2 were limited in scope 
to HBase 2.x.


Option 1: we copy the necessary proto files from HBase into the 
operator-tools and try to remember that, if we make any change to the 
serialization of the storefile list files, we have to copy that change 
to HBCK2. Brittle on the surface but effective.


Option 2: We bump HBCK2 to hbase-2.6.0-SNAPSHOT. Problematic until we 
make an HBase 2.6.0[-alpha] release. We should already have wire compat 
between all of HBase 2.x which makes that a non-issue.


Option 3: We create an HBCK3 targeted for HBase 3.x. I'm not convinced 
we need to do that (hbck for hbase 3.x would be just like hbck for hbase 
2.x). This would also not solve the problem for the SFT feature in hbase 
2.6.


I think option 3 is a no-go. I am leaning towards option 1 at this 
point. Hopefully my thought process is helpful for others to weigh in.



On 2/14/22 11:31 AM, Szabolcs Bukros wrote:

Hi Folks!

While working on adding tools to handle potential FileBased
StoreFileTracker issues to HBCK2 (HBASE-26624
) I ran into multiple
problems I'm unsure how to solve.

First of all the tools would rely on files not yet available in any of the
released hbase artifacts. I tried to solve this without changing the hbase
dependency version to keep HBCK2 as hbase version independent as possible,
but none of the solutions I have found looked acceptable:
  - Pushing the logic to the hbase side (as far as I can tell) is not
feasible because it has to be able to repair meta which is easier when
hbase is down and the tool should be able to run without a working hbase.
  - The files tracking the store content are serialized proto objects so
while replicating those files in the operator tools is possible, it would
not be pretty.

Bumping operator tools to use hbase 2.6.0-SNAPSHOT (branch-2 has the SFT
changes) would mean that now we need that or a newer version to build the
project and a version check to avoid runtime problems with the new tools,
but otherwise this looks rather painless and backwards compatible. I know
operator tools tries to avoid having a hbase-specific release, but having
2.6 as a min version to build against might be acceptable.

While looking into this I also checked what needs to be done to make
operator tools work with hbase 3.0.0-alpha-3-SNAPSHOT. Most of the changes
are backwards compatible but not all of them and the ones that aren't would
make a big chunk of Fsck unusable with older hbases. For me that looks
acceptable since this is a major version change, but that would mean I can
not rely on a potential HBCK3 to fix SFT issues, I would also need a
solution for HBCK2.

I tried to look for plans/direction regarding the new 1.3 operator tools
but could not find any.

Do you think it would be possible to bump the hbase version it uses to
2.6.0-SNAPSHOT?
Do you think it would make sense to start working on a hbase3 compatible
branch or is it too early?

NOTE:
I'm aware hbase does not publish SNAPSHOT builds for years, but I do not
know how the internal build system works and if these artifacts would be
available for internal builds or not. I also do not know if necessary could
they be made available.



[jira] [Resolved] (HBASE-26644) Spurious compaction failures with file tracker

2022-02-10 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26644.

Resolution: Not A Problem

Yep, all good. I believe you fixed this in HBASE-26675

> Spurious compaction failures with file tracker
> --
>
> Key: HBASE-26644
> URL: https://issues.apache.org/jira/browse/HBASE-26644
> Project: HBase
>  Issue Type: Sub-task
>  Components: Compaction
>    Reporter: Josh Elser
>    Assignee: Josh Elser
>Priority: Major
>
> Noticed when running a basic {{{}hbase pe randomWrite{}}}, we'll see 
> compactions failing at various points.
> One example:
> {noformat}
> 2022-01-03 17:41:18,319 ERROR 
> [regionserver/localhost:16020-shortCompactions-0] 
> regionserver.CompactSplit(670): Compaction failed 
> region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
>  storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
> startTime=1641249666161
> java.io.IOException: Root-level entries already added in single-level mode
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
>   at 
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
>   at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
>   at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)  {noformat}
> This isn't a super-critical issue because compactions will be retried 
> automatically and they appear to eventually succeed. However, when the max 
> storefiles limit is reaching, this does cause ingest to hang (as I was doing 
> with my modest configuration).
> We had seen a similar kind of problem in our testing when backporting to 
> HBase 2.4 (not upstream as the decision was to not do this) which we 
> eventually tracked down to a bad merge-conflict resolution to the new HFile 
> Cleaner. However, initial investigations don't have the same exact problem.
> It seems that we have some kind of generic race condition. Would be good to 
> add more logging to catch this in the future (since we have two separate 
> instances of this category of bug already).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26644) Spurious compaction failures with file tracker

2022-02-10 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26644.

Resolution: Not A Problem

Yep, all good. I believe you fixed this in HBASE-26675

> Spurious compaction failures with file tracker
> --
>
> Key: HBASE-26644
> URL: https://issues.apache.org/jira/browse/HBASE-26644
> Project: HBase
>  Issue Type: Sub-task
>  Components: Compaction
>    Reporter: Josh Elser
>    Assignee: Josh Elser
>Priority: Major
>
> Noticed when running a basic {{{}hbase pe randomWrite{}}}, we'll see 
> compactions failing at various points.
> One example:
> {noformat}
> 2022-01-03 17:41:18,319 ERROR 
> [regionserver/localhost:16020-shortCompactions-0] 
> regionserver.CompactSplit(670): Compaction failed 
> region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
>  storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
> startTime=1641249666161
> java.io.IOException: Root-level entries already added in single-level mode
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
>   at 
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
>   at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
>   at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)  {noformat}
> This isn't a super-critical issue because compactions will be retried 
> automatically and they appear to eventually succeed. However, when the max 
> storefiles limit is reaching, this does cause ingest to hang (as I was doing 
> with my modest configuration).
> We had seen a similar kind of problem in our testing when backporting to 
> HBase 2.4 (not upstream as the decision was to not do this) which we 
> eventually tracked down to a bad merge-conflict resolution to the new HFile 
> Cleaner. However, initial investigations don't have the same exact problem.
> It seems that we have some kind of generic race condition. Would be good to 
> add more logging to catch this in the future (since we have two separate 
> instances of this category of bug already).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26236) Simple travis build for hbase-filesystem

2022-01-24 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26236:
---
Hadoop Flags: Reviewed
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

Thanks so much for finishing this, Peter

> Simple travis build for hbase-filesystem
> 
>
> Key: HBASE-26236
> URL: https://issues.apache.org/jira/browse/HBASE-26236
> Project: HBase
>  Issue Type: Improvement
>  Components: hboss
>    Reporter: Josh Elser
>Assignee: Peter Somogyi
>Priority: Major
> Fix For: hbase-filesystem-1.0.0-alpha2
>
>
> Noticed that we don't have any kind of precommit checks. Time to make a quick 
> one.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26007) java.io.IOException: Invalid token in javax.security.sasl.qop: ^DDI

2022-01-20 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479746#comment-17479746
 ] 

Josh Elser commented on HBASE-26007:


Maybe Venkat can report if recompiling HBase against his version of Hadoop will 
fix the issue for him too.

IIRC, we have documented in the HBase book that it's a best practice to compile 
HBase against the specific version of Hadoop you're using.

> java.io.IOException: Invalid token in javax.security.sasl.qop: ^DDI
> ---
>
> Key: HBASE-26007
> URL: https://issues.apache.org/jira/browse/HBASE-26007
> Project: HBase
>  Issue Type: Bug
>  Components: master
>Affects Versions: 2.3.5
>Reporter: Venkat A
>Priority: Major
>
> Hi All,
> We have Hadoop 3.2.2 and HBase 2.3.5 Versions installed. (java version 
> "1.8.0_291")
>  
> While bringing up HBase master, I'm seeing following error messages in HBase 
> master log.
>  
> Other HDFS. clients like Spark,MapReduce,Solr etc are able to write HDFS but 
> HBase is unable to write its meta files in HDFS with following exceptions.
>  
> > Summary of Error logs from hbase master 
> 2021-06-15 03:57:45,968 INFO [Thread-7] hdfs.DataStreamer: Exception in 
> createBlockOutputStream
>  java.io.IOException: Invalid token in javax.security.sasl.qop: ^DD
>  2021-06-15 03:57:45,939 WARN [Thread-7] hdfs.DataStreamer: Abandoning 
> BP-1583998547-10.10.10.3-1622148262434:blk_1073743393_2570
>  2021-06-15 03:57:45,946 WARN [Thread-7] hdfs.DataStreamer: Excluding 
> datanode 
> DatanodeInfoWithStorage[10.10.10.3:50010,DS-281c3377-2bc1-47ea-8302-43108ee69430,DISK]
>  2021-06-15 03:57:45,994 WARN [Thread-7] hdfs.DataStreamer: DataStreamer 
> Exception
>  org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /hbase/data/data/hbase/meta/.tmp/.tableinfo.01 could only be written 
> to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 
> node(s) are excluded in this operation.
>  2021-06-15 03:57:46,023 INFO [Thread-9] hdfs.DataStreamer: Exception in 
> createBlockOutputStream
>  java.io.IOException: Invalid token in javax.security.sasl.qop: ^DDI
>  2021-06-15 03:57:46,035 INFO [Thread-9] hdfs.DataStreamer: Exception in 
> createBlockOutputStream
>  java.io.IOException: Invalid token in javax.security.sasl.qop: ^DD
>  2021-06-15 03:57:46,508 ERROR [main] regionserver.HRegionServer: Failed 
> construction RegionServer
>  java.io.IOException: Failed update hbase:meta table descriptor
>  2021-06-15 03:57:46,509 ERROR [main] master.HMasterCommandLine: Master 
> exiting
>  java.lang.RuntimeException: Failed construction of Master: class 
> org.apache.hadoop.hbase.master.HMaster.
>  Caused by: java.io.IOException: Failed update hbase:meta table descriptor
>  
> Not sure what is the root cause behind this. Any comments/suggestions on this 
> is much appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26655) Initial commit with basic functionality and example code

2022-01-20 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26655.

Hadoop Flags: Reviewed
  Resolution: Fixed

> Initial commit with basic functionality and example code
> 
>
> Key: HBASE-26655
> URL: https://issues.apache.org/jira/browse/HBASE-26655
> Project: HBase
>  Issue Type: Sub-task
>  Components: security
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
> Fix For: HBASE-26553
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26655) Initial commit with basic functionality and example code

2022-01-20 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26655.

Hadoop Flags: Reviewed
  Resolution: Fixed

> Initial commit with basic functionality and example code
> 
>
> Key: HBASE-26655
> URL: https://issues.apache.org/jira/browse/HBASE-26655
> Project: HBase
>  Issue Type: Sub-task
>  Components: security
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
> Fix For: HBASE-26553
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26687) Account for HBASE-24500 in regionInfoMismatch tool

2022-01-19 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26687.

Hadoop Flags: Reviewed
  Resolution: Fixed

Thanks for the speedy review, Peter!

> Account for HBASE-24500 in regionInfoMismatch tool
> --
>
> Key: HBASE-26687
> URL: https://issues.apache.org/jira/browse/HBASE-26687
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>    Reporter: Josh Elser
>    Assignee: Josh Elser
>Priority: Minor
> Fix For: hbase-operator-tools-1.3.0
>
>
> Had a coworker try to use the RegionInfoMismatch tool I added in HBASE-26656. 
> Curiously, the tool failed on the sanity check I added.
> {noformat}
> Aborting: sanity-check failed on updated RegionInfo. Expected encoded region 
> name 736ee6186975de6967cd9e9e242423f0 but got 
> 323748c77dde5b05982df0285b013232.
> Incorrectly created RegionInfo was: {ENCODED => 
> 323748c77dde5b05982df0285b013232, NAME => 
> 'test4,,1642405560420_0002.323748c77dde5b05982df0285b013232.', STARTKEY => 
> '', ENDKEY => ''}
> {noformat}
> I couldn't understand why the tool wasn't working until I hooked up a 
> debugger and realized that the problem wasn't in my code :). The version of 
> HBase on the system did not have the fix from HBASE-24500 included which 
> meant that I was hitting the same "strange behavior", as Duo put it, in the 
> RegionInfoBuilder "copy constructor".
> While the versions of HBase which do not have this fix are EOL in terms of 
> Apache releases, we can easily work around this in operator-tools (which may 
> be used by any hbase 2.x release still in the wild).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HBASE-26687) Account for HBASE-24500 in regionInfoMismatch tool

2022-01-19 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-26687.

Hadoop Flags: Reviewed
  Resolution: Fixed

Thanks for the speedy review, Peter!

> Account for HBASE-24500 in regionInfoMismatch tool
> --
>
> Key: HBASE-26687
> URL: https://issues.apache.org/jira/browse/HBASE-26687
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>    Reporter: Josh Elser
>    Assignee: Josh Elser
>Priority: Minor
> Fix For: hbase-operator-tools-1.3.0
>
>
> Had a coworker try to use the RegionInfoMismatch tool I added in HBASE-26656. 
> Curiously, the tool failed on the sanity check I added.
> {noformat}
> Aborting: sanity-check failed on updated RegionInfo. Expected encoded region 
> name 736ee6186975de6967cd9e9e242423f0 but got 
> 323748c77dde5b05982df0285b013232.
> Incorrectly created RegionInfo was: {ENCODED => 
> 323748c77dde5b05982df0285b013232, NAME => 
> 'test4,,1642405560420_0002.323748c77dde5b05982df0285b013232.', STARTKEY => 
> '', ENDKEY => ''}
> {noformat}
> I couldn't understand why the tool wasn't working until I hooked up a 
> debugger and realized that the problem wasn't in my code :). The version of 
> HBase on the system did not have the fix from HBASE-24500 included which 
> meant that I was hitting the same "strange behavior", as Duo put it, in the 
> RegionInfoBuilder "copy constructor".
> While the versions of HBase which do not have this fix are EOL in terms of 
> Apache releases, we can easily work around this in operator-tools (which may 
> be used by any hbase 2.x release still in the wild).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26687) Account for HBASE-24500 in regionInfoMismatch tool

2022-01-19 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478911#comment-17478911
 ] 

Josh Elser commented on HBASE-26687:


Looks like there isn't an easy way to test this because operator-tools cannot 
compile against HBase 2.2.5 (which was the last release which didn't have this 
fixed).
{noformat}
[ERROR] 
/.../hbase-operator-tools.git/hbase-hbck2/src/main/java/org/apache/hbase/HBCK2.java:[423,16]
 cannot find symbol
[ERROR]   symbol:   method scheduleSCPsForUnknownServers()
[ERROR]   location: variable hbck of type org.apache.hadoop.hbase.client.Hbck 
{noformat}

> Account for HBASE-24500 in regionInfoMismatch tool
> --
>
> Key: HBASE-26687
> URL: https://issues.apache.org/jira/browse/HBASE-26687
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>    Reporter: Josh Elser
>    Assignee: Josh Elser
>Priority: Minor
> Fix For: hbase-operator-tools-1.3.0
>
>
> Had a coworker try to use the RegionInfoMismatch tool I added in HBASE-26656. 
> Curiously, the tool failed on the sanity check I added.
> {noformat}
> Aborting: sanity-check failed on updated RegionInfo. Expected encoded region 
> name 736ee6186975de6967cd9e9e242423f0 but got 
> 323748c77dde5b05982df0285b013232.
> Incorrectly created RegionInfo was: {ENCODED => 
> 323748c77dde5b05982df0285b013232, NAME => 
> 'test4,,1642405560420_0002.323748c77dde5b05982df0285b013232.', STARTKEY => 
> '', ENDKEY => ''}
> {noformat}
> I couldn't understand why the tool wasn't working until I hooked up a 
> debugger and realized that the problem wasn't in my code :). The version of 
> HBase on the system did not have the fix from HBASE-24500 included which 
> meant that I was hitting the same "strange behavior", as Duo put it, in the 
> RegionInfoBuilder "copy constructor".
> While the versions of HBase which do not have this fix are EOL in terms of 
> Apache releases, we can easily work around this in operator-tools (which may 
> be used by any hbase 2.x release still in the wild).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26687) Account for HBASE-24500 in regionInfoMismatch tool

2022-01-19 Thread Josh Elser (Jira)
Josh Elser created HBASE-26687:
--

 Summary: Account for HBASE-24500 in regionInfoMismatch tool
 Key: HBASE-26687
 URL: https://issues.apache.org/jira/browse/HBASE-26687
 Project: HBase
  Issue Type: Bug
  Components: hbck2
Reporter: Josh Elser
Assignee: Josh Elser
 Fix For: hbase-operator-tools-1.3.0


Had a coworker try to use the RegionInfoMismatch tool I added in HBASE-26656. 
Curiously, the tool failed on the sanity check I added.
{noformat}
Aborting: sanity-check failed on updated RegionInfo. Expected encoded region 
name 736ee6186975de6967cd9e9e242423f0 but got 323748c77dde5b05982df0285b013232.
Incorrectly created RegionInfo was: {ENCODED => 
323748c77dde5b05982df0285b013232, NAME => 
'test4,,1642405560420_0002.323748c77dde5b05982df0285b013232.', STARTKEY => '', 
ENDKEY => ''}

{noformat}
I couldn't understand why the tool wasn't working until I hooked up a debugger 
and realized that the problem wasn't in my code :). The version of HBase on the 
system did not have the fix from HBASE-24500 included which meant that I was 
hitting the same "strange behavior", as Duo put it, in the RegionInfoBuilder 
"copy constructor".

While the versions of HBase which do not have this fix are EOL in terms of 
Apache releases, we can easily work around this in operator-tools (which may be 
used by any hbase 2.x release still in the wild).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26687) Account for HBASE-24500 in regionInfoMismatch tool

2022-01-19 Thread Josh Elser (Jira)
Josh Elser created HBASE-26687:
--

 Summary: Account for HBASE-24500 in regionInfoMismatch tool
 Key: HBASE-26687
 URL: https://issues.apache.org/jira/browse/HBASE-26687
 Project: HBase
  Issue Type: Bug
  Components: hbck2
Reporter: Josh Elser
Assignee: Josh Elser
 Fix For: hbase-operator-tools-1.3.0


Had a coworker try to use the RegionInfoMismatch tool I added in HBASE-26656. 
Curiously, the tool failed on the sanity check I added.
{noformat}
Aborting: sanity-check failed on updated RegionInfo. Expected encoded region 
name 736ee6186975de6967cd9e9e242423f0 but got 323748c77dde5b05982df0285b013232.
Incorrectly created RegionInfo was: {ENCODED => 
323748c77dde5b05982df0285b013232, NAME => 
'test4,,1642405560420_0002.323748c77dde5b05982df0285b013232.', STARTKEY => '', 
ENDKEY => ''}

{noformat}
I couldn't understand why the tool wasn't working until I hooked up a debugger 
and realized that the problem wasn't in my code :). The version of HBase on the 
system did not have the fix from HBASE-24500 included which meant that I was 
hitting the same "strange behavior", as Duo put it, in the RegionInfoBuilder 
"copy constructor".

While the versions of HBase which do not have this fix are EOL in terms of 
Apache releases, we can easily work around this in operator-tools (which may be 
used by any hbase 2.x release still in the wild).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26674) TestWriteHeavyIncrementObserver is flaky

2022-01-15 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476658#comment-17476658
 ] 

Josh Elser commented on HBASE-26674:


Thanks, Duo. I would guess the same root cause as HBASE-26644. I think I had 
tried to revert HBASE-26271 to see if that addressed the issue. I'll try to dig 
into this next week.

> TestWriteHeavyIncrementObserver is flaky
> 
>
> Key: HBASE-26674
> URL: https://issues.apache.org/jira/browse/HBASE-26674
> Project: HBase
>  Issue Type: Bug
>Reporter: Duo Zhang
>Priority: Major
>
> The stacktrace
> {noformat}
> java.lang.IllegalArgumentException
>   at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:131)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.getCurrentEligibleFiles(SortedCompactionPolicy.java:173)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.preSelectCompactionForCoprocessor(SortedCompactionPolicy.java:44)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.preSelect(DefaultStoreEngine.java:130)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1438)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1419)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2238)
>   at 
> org.apache.hadoop.hbase.coprocessor.example.TestWriteHeavyIncrementObserver.test(TestWriteHeavyIncrementObserver.java:70)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Haven't seen it flaky before. And this is a runtime exception in our non test 
> code base, which seems critical.
> Not sure if it has the same root cause with HBASE-26644.
> Need to dig more.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26669) Add JWT section to HBase book

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26669:
--

 Summary: Add JWT section to HBase book
 Key: HBASE-26669
 URL: https://issues.apache.org/jira/browse/HBASE-26669
 Project: HBase
  Issue Type: Sub-task
  Components: documentation
Reporter: Josh Elser
 Fix For: HBASE-26553


Add a chapter to the hbase book about JWT authentication and everything that 
users and admins need to know.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26669) Add JWT section to HBase book

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26669:
--

 Summary: Add JWT section to HBase book
 Key: HBASE-26669
 URL: https://issues.apache.org/jira/browse/HBASE-26669
 Project: HBase
  Issue Type: Sub-task
  Components: documentation
Reporter: Josh Elser
 Fix For: HBASE-26553


Add a chapter to the hbase book about JWT authentication and everything that 
users and admins need to know.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26668) Define user experience for JWT renewal

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26668:
--

 Summary: Define user experience for JWT renewal
 Key: HBASE-26668
 URL: https://issues.apache.org/jira/browse/HBASE-26668
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
 Fix For: HBASE-26553


We need to define what our level of support will be for an HBase application 
which must run longer than the lifetime of a JWT token.

The JWT 2.0 RFCs mention different kinds of tokens, notably a Refresh token may 
be helpful [https://datatracker.ietf.org/doc/html/rfc8693]

This is inter-twined with HBASE-26667. For example, if we maintained a Refresh 
token in the client, we would have to build in logic (like we have for Kerberos 
credentials) to automatically launch a thread and know where to obtain a new 
JWT token from.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26668) Define user experience for JWT renewal

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26668:
--

 Summary: Define user experience for JWT renewal
 Key: HBASE-26668
 URL: https://issues.apache.org/jira/browse/HBASE-26668
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
 Fix For: HBASE-26553


We need to define what our level of support will be for an HBase application 
which must run longer than the lifetime of a JWT token.

The JWT 2.0 RFCs mention different kinds of tokens, notably a Refresh token may 
be helpful [https://datatracker.ietf.org/doc/html/rfc8693]

This is inter-twined with HBASE-26667. For example, if we maintained a Refresh 
token in the client, we would have to build in logic (like we have for Kerberos 
credentials) to automatically launch a thread and know where to obtain a new 
JWT token from.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26667) Integrate user-experience for hbase-client

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26667:
--

 Summary: Integrate user-experience for hbase-client
 Key: HBASE-26667
 URL: https://issues.apache.org/jira/browse/HBASE-26667
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
 Fix For: HBASE-26553


Today, we have two mechanism in order to get the tokens needed to authenticate:
 # Kerberos, we rely on a Kerberos ticket being present in a well-known 
location (defined by JVM properties) or via programmatic invocation of 
UserGroupInformation
 # Delegation tokens, we rely on special API to be called (our mapreduce API) 
which loads the token into the current UserGroupInformation "context" (the JAAS 
PrivilegedAction).

The JWT bearer token approach is very similar to the delegation token 
mechanism, but HBase does not generate this JWT (as we do with delegation 
tokens). How does a client provide this token to the hbase-client (i.e. 
{{ConnectionFactory.getConnection()}} or a {{UserGroupInformation}} call)? We 
should be mindful of all of the different "entrypoints" to HBase ({{{}hbase 
...{}}} commands, {{java -cp}} commands, Phoenix commands, Spark comands, etc). 
Our solution should be effective for all of these approaches and not require 
downstream changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26667) Integrate user-experience for hbase-client

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26667:
--

 Summary: Integrate user-experience for hbase-client
 Key: HBASE-26667
 URL: https://issues.apache.org/jira/browse/HBASE-26667
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
 Fix For: HBASE-26553


Today, we have two mechanism in order to get the tokens needed to authenticate:
 # Kerberos, we rely on a Kerberos ticket being present in a well-known 
location (defined by JVM properties) or via programmatic invocation of 
UserGroupInformation
 # Delegation tokens, we rely on special API to be called (our mapreduce API) 
which loads the token into the current UserGroupInformation "context" (the JAAS 
PrivilegedAction).

The JWT bearer token approach is very similar to the delegation token 
mechanism, but HBase does not generate this JWT (as we do with delegation 
tokens). How does a client provide this token to the hbase-client (i.e. 
{{ConnectionFactory.getConnection()}} or a {{UserGroupInformation}} call)? We 
should be mindful of all of the different "entrypoints" to HBase ({{{}hbase 
...{}}} commands, {{java -cp}} commands, Phoenix commands, Spark comands, etc). 
Our solution should be effective for all of these approaches and not require 
downstream changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26666) Address bearer token being sent over wire before RPC encryption is enabled

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-2:
--

 Summary: Address bearer token being sent over wire before RPC 
encryption is enabled
 Key: HBASE-2
 URL: https://issues.apache.org/jira/browse/HBASE-2
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
 Fix For: HBASE-26553


Today, HBase must complete the SASL handshake (saslClient.complete()) prior to 
turning on any RPC encryption (hbase.rpc.protection=privacy, 
sasl.QOP=auth-conf).

This is a problem because we have to transmit the bearer token to the server 
before we can complete the sasl handshake. This would mean that we would 
insecurely transmit the bearer token (which is equivalent to any other 
password) which is a bad smell.

Ideally, if we can solve this problem for the oauth bearer mechanism, we could 
also apply it to our delegation token interface for digest-md5 (which, I 
believe, suffers the same problem).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26666) Address bearer token being sent over wire before RPC encryption is enabled

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-2:
--

 Summary: Address bearer token being sent over wire before RPC 
encryption is enabled
 Key: HBASE-2
 URL: https://issues.apache.org/jira/browse/HBASE-2
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
 Fix For: HBASE-26553


Today, HBase must complete the SASL handshake (saslClient.complete()) prior to 
turning on any RPC encryption (hbase.rpc.protection=privacy, 
sasl.QOP=auth-conf).

This is a problem because we have to transmit the bearer token to the server 
before we can complete the sasl handshake. This would mean that we would 
insecurely transmit the bearer token (which is equivalent to any other 
password) which is a bad smell.

Ideally, if we can solve this problem for the oauth bearer mechanism, we could 
also apply it to our delegation token interface for digest-md5 (which, I 
believe, suffers the same problem).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26665) Standalone unit test in hbase-examples

2022-01-13 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26665:
---
Fix Version/s: HBASE-26553

> Standalone unit test in hbase-examples
> --
>
> Key: HBASE-26665
> URL: https://issues.apache.org/jira/browse/HBASE-26665
> Project: HBase
>  Issue Type: Sub-task
>    Reporter: Josh Elser
>Assignee: Andor Molnar
>Priority: Major
> Fix For: HBASE-26553
>
>
> Andor is already working on this with nimbus, but filing this for him.
> We should have a unit test which exercises the oauth bearer authentication 
> mechanism so that we know if the feature is functional at a basic level 
> (without having to set up on OAuth server).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26655) Initial commit with basic functionality and example code

2022-01-13 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26655:
---
Fix Version/s: HBASE-26553

> Initial commit with basic functionality and example code
> 
>
> Key: HBASE-26655
> URL: https://issues.apache.org/jira/browse/HBASE-26655
> Project: HBase
>  Issue Type: Sub-task
>  Components: security
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
> Fix For: HBASE-26553
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26553) OAuth Bearer authentication mech plugin for SASL

2022-01-13 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26553:
---
Fix Version/s: HBASE-26553

> OAuth Bearer authentication mech plugin for SASL
> 
>
> Key: HBASE-26553
> URL: https://issues.apache.org/jira/browse/HBASE-26553
> Project: HBase
>  Issue Type: New Feature
>  Components: security
>Reporter: Andor Molnar
>Assignee: Andor Molnar
>Priority: Major
> Fix For: HBASE-26553
>
>
> Implementation of a new SASL plugin to add support for OAuth Bearer token 
> authentication for HBase client RPC.
>  * The plugin supports secured (cryptographically signed) JSON Web Token 
> authentication as defined in 
> [RFC-7628|https://datatracker.ietf.org/doc/html/rfc7628]  and the JWT format 
> in [RFC-7519|https://datatracker.ietf.org/doc/html/rfc7519] .
>  * The implementation is inspired by [Apache Kafka's OAuth Bearer 
> token|https://docs.confluent.io/platform/current/kafka/authentication_sasl/authentication_sasl_oauth.html]
>  support with the important difference that HBase version is intended for 
> production usage. The two main differences are that Kafka supports unsecured 
> tokens only and it issues the tokens for itself which breaks the principle of 
> OAuth token authentication.
>  * We use the [Nimbus JOSE + 
> JWT|https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/] Java 
> library for signature verification and token processing and we add it as a 
> new dependency to HBase.
>  * We add secure JWT support and verification of digital signatures with 
> multiple algorithms as supported by Nimbus. Json-formatted JWK set is 
> required for the signature verification as defined in 
> [RFC-7517|https://datatracker.ietf.org/doc/html/rfc7517].
>  * The impl is verified with Apache Knox issued tokens, because that's the 
> primary use case of this new feature.
>  * New client example is added to the hbase-examples project to showcase the 
> feature.
>  * It's important that this Jira does not cover the solution for obtaining a 
> token from Knox. The assumption is that the client already has a valid token 
> in base64 encoded string and we only provide a helper method for adding it to 
> user's credentials.
>  * Renewing expired tokens is also the responsibility of the client. We don't 
> provide a mechanism for that in this Jira, but it's planned to be covered in 
> a follow-up ticket.
> The following new parameters are introduced in hbase-site.xml:
>  * hbase.security.oauth.jwt.jwks.file - Path of a local file for JWK set. 
> (required if URL not specified)
>  * hbase.security.oauth.jwt.jwks.url - URL to download the JWK set. (required 
> if File not specified)
>  * hbase.security.oauth.jwt.audience - Required audience, "aud" claim of the 
> JWT. (optional)
>  * hbase.security.oauth.jwt.issuer - Required issuer, "iss" claim of the JWT. 
> (optional)
> The feature will be behind feature-flag. No code part is executed unless the 
> following configuration is set in hbase-site.xml:
> {noformat}
>   
>     hbase.client.sasl.provider.extras
>     
> org.apache.hadoop.hbase.security.provider.OAuthBearerSaslClientAuthenticationProvider
>   
>   
>     hbase.server.sasl.provider.extras
>     
> org.apache.hadoop.hbase.security.provider.OAuthBearerSaslServerAuthenticationProvider
>   
>   
>     hbase.client.sasl.provider.class
>     
> org.apache.hadoop.hbase.security.provider.OAuthBearerSaslProviderSelector
>   
> {noformat}
> Example of Knox provided JWKS file:
> {noformat}
> {
>   "keys":
>   [{
> "kty": "RSA",
> "e": "",
> "use": "sig",
> "kid": "",
> "alg": "RS256",
> "n": ""
>   }]
> }{noformat}
> Example of Knox issued JWT header:
> {noformat}
> {
> "jku": "https://path/to/homepage/knoxtoken/api/v1/jwks.json;,
> "kid": "",
> "alg": "RS256"
> }{noformat}
> And payload:
> {noformat}
> {
>   "sub": "user_andor",
>   "aud": "knox-proxy-token",
>   "jku": "https://path/to/homepage/knoxtoken/api/v1/jwks.json;,
>   "kid": "",
>   "iss": "KNOXSSO",
>   "exp": 1636644029,
>   "managed.token": "true",
>   "knox.id": ""
> }{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26665) Standalone unit test in hbase-examples

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26665:
--

 Summary: Standalone unit test in hbase-examples
 Key: HBASE-26665
 URL: https://issues.apache.org/jira/browse/HBASE-26665
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
Assignee: Andor Molnar


Andor is already working on this with nimbus, but filing this for him.

We should have a unit test which exercises the oauth bearer authentication 
mechanism so that we know if the feature is functional at a basic level 
(without having to set up on OAuth server).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26665) Standalone unit test in hbase-examples

2022-01-13 Thread Josh Elser (Jira)
Josh Elser created HBASE-26665:
--

 Summary: Standalone unit test in hbase-examples
 Key: HBASE-26665
 URL: https://issues.apache.org/jira/browse/HBASE-26665
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser
Assignee: Andor Molnar


Andor is already working on this with nimbus, but filing this for him.

We should have a unit test which exercises the oauth bearer authentication 
mechanism so that we know if the feature is functional at a basic level 
(without having to set up on OAuth server).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26656) [operator-tools] Provide a utility to detect and correct incorrect RegionInfo's in hbase:meta

2022-01-12 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26656:
---
Hadoop Flags: Reviewed
Release Note: This tool will read hbase:meta and report any regions whose 
rowkey and cell value differ in their encoded region name. HBASE-23328 
illustrates a problem for read-replica enabled tables in which the encoded 
region name (the MD5 hash) does not match between the rowkey and the value. 
This problem is generally harmless for normal operation, but can break other 
HBCK2 tools.
  Resolution: Fixed
  Status: Resolved  (was: Patch Available)

Thanks for your reviews, Peter.

> [operator-tools] Provide a utility to detect and correct incorrect 
> RegionInfo's in hbase:meta
> -
>
> Key: HBASE-26656
> URL: https://issues.apache.org/jira/browse/HBASE-26656
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools, hbck2
>    Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Fix For: hbase-operator-tools-1.3.0
>
>
> HBASE-23328 describes a problem in which the serialized RegionInfo in the 
> value of hbase:meta cells have an encoded regionname which doesn't match the 
> encoded region name in the rowkey for that cell.
> This problem is normally harmless as assignment only consults the rowkey to 
> get the encoded region name. However, this problem does break other HBCK2 
> tooling, like {{{}extraRegionsInMeta{}}}. 
> Rather than try to update each tool to account for when this problem may be 
> present, create a new tool which an operator can run to correct meta and then 
> use any subsequent tools as originally intended.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HBASE-26644) Spurious compaction failures with file tracker

2022-01-10 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser reassigned HBASE-26644:
--

Assignee: Josh Elser

> Spurious compaction failures with file tracker
> --
>
> Key: HBASE-26644
> URL: https://issues.apache.org/jira/browse/HBASE-26644
> Project: HBase
>  Issue Type: Sub-task
>  Components: Compaction
>    Reporter: Josh Elser
>    Assignee: Josh Elser
>Priority: Major
>
> Noticed when running a basic {{{}hbase pe randomWrite{}}}, we'll see 
> compactions failing at various points.
> One example:
> {noformat}
> 2022-01-03 17:41:18,319 ERROR 
> [regionserver/localhost:16020-shortCompactions-0] 
> regionserver.CompactSplit(670): Compaction failed 
> region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
>  storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
> startTime=1641249666161
> java.io.IOException: Root-level entries already added in single-level mode
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
>   at 
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
>   at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
>   at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)  {noformat}
> This isn't a super-critical issue because compactions will be retried 
> automatically and they appear to eventually succeed. However, when the max 
> storefiles limit is reaching, this does cause ingest to hang (as I was doing 
> with my modest configuration).
> We had seen a similar kind of problem in our testing when backporting to 
> HBase 2.4 (not upstream as the decision was to not do this) which we 
> eventually tracked down to a bad merge-conflict resolution to the new HFile 
> Cleaner. However, initial investigations don't have the same exact problem.
> It seems that we have some kind of generic race condition. Would be good to 
> add more logging to catch this in the future (since we have two separate 
> instances of this category of bug already).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26644) Spurious compaction failures with file tracker

2022-01-10 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472375#comment-17472375
 ] 

Josh Elser commented on HBASE-26644:


No, sorry. Have been pulled into other stuff. I'll try to come back here.

> Spurious compaction failures with file tracker
> --
>
> Key: HBASE-26644
> URL: https://issues.apache.org/jira/browse/HBASE-26644
> Project: HBase
>  Issue Type: Sub-task
>  Components: Compaction
>    Reporter: Josh Elser
>Priority: Major
>
> Noticed when running a basic {{{}hbase pe randomWrite{}}}, we'll see 
> compactions failing at various points.
> One example:
> {noformat}
> 2022-01-03 17:41:18,319 ERROR 
> [regionserver/localhost:16020-shortCompactions-0] 
> regionserver.CompactSplit(670): Compaction failed 
> region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
>  storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
> startTime=1641249666161
> java.io.IOException: Root-level entries already added in single-level mode
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
>   at 
> org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
>   at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
>   at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
>   at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
>   at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)  {noformat}
> This isn't a super-critical issue because compactions will be retried 
> automatically and they appear to eventually succeed. However, when the max 
> storefiles limit is reaching, this does cause ingest to hang (as I was doing 
> with my modest configuration).
> We had seen a similar kind of problem in our testing when backporting to 
> HBase 2.4 (not upstream as the decision was to not do this) which we 
> eventually tracked down to a bad merge-conflict resolution to the new HFile 
> Cleaner. However, initial investigations don't have the same exact problem.
> It seems that we have some kind of generic race condition. Would be good to 
> add more logging to catch this in the future (since we have two separate 
> instances of this category of bug already).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26656) [operator-tools] Provide a utility to detect and correct incorrect RegionInfo's in hbase:meta

2022-01-10 Thread Josh Elser (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-26656:
---
Status: Patch Available  (was: Open)

> [operator-tools] Provide a utility to detect and correct incorrect 
> RegionInfo's in hbase:meta
> -
>
> Key: HBASE-26656
> URL: https://issues.apache.org/jira/browse/HBASE-26656
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools, hbck2
>    Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Major
> Fix For: hbase-operator-tools-2.0.0
>
>
> HBASE-23328 describes a problem in which the serialized RegionInfo in the 
> value of hbase:meta cells have an encoded regionname which doesn't match the 
> encoded region name in the rowkey for that cell.
> This problem is normally harmless as assignment only consults the rowkey to 
> get the encoded region name. However, this problem does break other HBCK2 
> tooling, like {{{}extraRegionsInMeta{}}}. 
> Rather than try to update each tool to account for when this problem may be 
> present, create a new tool which an operator can run to correct meta and then 
> use any subsequent tools as originally intended.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26656) [operator-tools] Provide a utility to detect and correct incorrect RegionInfo's in hbase:meta

2022-01-10 Thread Josh Elser (Jira)
Josh Elser created HBASE-26656:
--

 Summary: [operator-tools] Provide a utility to detect and correct 
incorrect RegionInfo's in hbase:meta
 Key: HBASE-26656
 URL: https://issues.apache.org/jira/browse/HBASE-26656
 Project: HBase
  Issue Type: Improvement
  Components: hbase-operator-tools, hbck2
Reporter: Josh Elser
Assignee: Josh Elser
 Fix For: hbase-operator-tools-2.0.0


HBASE-23328 describes a problem in which the serialized RegionInfo in the value 
of hbase:meta cells have an encoded regionname which doesn't match the encoded 
region name in the rowkey for that cell.

This problem is normally harmless as assignment only consults the rowkey to get 
the encoded region name. However, this problem does break other HBCK2 tooling, 
like {{{}extraRegionsInMeta{}}}. 

Rather than try to update each tool to account for when this problem may be 
present, create a new tool which an operator can run to correct meta and then 
use any subsequent tools as originally intended.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26656) [operator-tools] Provide a utility to detect and correct incorrect RegionInfo's in hbase:meta

2022-01-10 Thread Josh Elser (Jira)
Josh Elser created HBASE-26656:
--

 Summary: [operator-tools] Provide a utility to detect and correct 
incorrect RegionInfo's in hbase:meta
 Key: HBASE-26656
 URL: https://issues.apache.org/jira/browse/HBASE-26656
 Project: HBase
  Issue Type: Improvement
  Components: hbase-operator-tools, hbck2
Reporter: Josh Elser
Assignee: Josh Elser
 Fix For: hbase-operator-tools-2.0.0


HBASE-23328 describes a problem in which the serialized RegionInfo in the value 
of hbase:meta cells have an encoded regionname which doesn't match the encoded 
region name in the rowkey for that cell.

This problem is normally harmless as assignment only consults the rowkey to get 
the encoded region name. However, this problem does break other HBCK2 tooling, 
like {{{}extraRegionsInMeta{}}}. 

Rather than try to update each tool to account for when this problem may be 
present, create a new tool which an operator can run to correct meta and then 
use any subsequent tools as originally intended.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26469) correct HBase shell exit behavior to match code passed to exit

2022-01-09 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471356#comment-17471356
 ] 

Josh Elser commented on HBASE-26469:


Pardon the brevity: I was just also agreeing with Mike's point and the 
discussion you had with him. I agree that the second table immediately above is 
the correct thing to do (matching earlier 2.x).

> correct HBase shell exit behavior to match code passed to exit
> --
>
> Key: HBASE-26469
> URL: https://issues.apache.org/jira/browse/HBASE-26469
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Affects Versions: 2.5.0, 3.0.0-alpha-2, 2.4.8
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
> Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.10
>
> Attachments: hbase-1.4.14-exit-behavior.log, 
> hbase-1.7.1-exit-behavior.log, hbase-2.0.6-exit-behavior.log, 
> hbase-2.1.9-exit-behavior.log, hbase-2.2.7-exit-behavior.log, 
> hbase-2.3.7-exit-behavior.log, hbase-2.4.8-exit-behavior.log, 
> hbase-3.0.0-alpha-2-exit-behavior.log
>
>
> The HBase shell has changed behavior in a way that breaks being able to exit 
> properly.
> Two example scripts to act as stand ins for hbase shell scripts to "do 
> something simple then exit":
> {code}
> tmp % echo "list\nexit" > clean_exit.rb
> tmp % echo "list\nexit 1" > error_exit.rb
> {code}
> Giving these two scripts is possible:
> * passed as a cli argument
> * via redirected stdin
> Additionally the shell invocation can be:
> * in the default compatibility mode
> * with the "non interactive" flag that gives an exit code that reflects 
> runtime errors
> I'll post logs of the details as attachments but here are some tables of the 
> exit codes.
> The {{clean_exit.rb}} invocations ought to exit with success, exit code 0.
> || ||  1.4.14 || 1.7.1 || 2.0.6 || 2.1.9 || 2.2.7 || 2.3.7 || 2.4.8 
> || master ||
> | cli, default |0 |0   |0   |0   |0   |0   |1   | 
>1*   |
> | cli, -n | 0 |0   |0   |0   |0   |0   |1   | 
>  hang   |
> | stdin, default |  0 |0   |0   |0   |0   |0   |0   | 
>0|
> | stdin, -n |   1 |1   |1   |1   |1   |1   |1*  | 
>1*   |
> The {{error_exit.rb}} invocation should return a non-zero exit code, unless 
> we're specifically trying to match a normal hbase shell session.
> || || 1.4.14 || 1.7.1 || 2.0.6 || 2.1.9 || 2.2.7 || 2.3.7 || 2.4.8 || 
> master ||
> | cli, default |   1 |1   |1   |1   |1   |1   |1*  |  
>   1*   |
> | cli, -n |1 |1   |1   |1   |1   |1   |1*  |  
> hang   |
> | stdin, default | 0 |0   |0   |0   |0   |0   |0   |  
>   0|
> | stdin, -n |  1 |1   |1   |1   |1   |1   |1*  |  
>   1*   |
> In cases marked with * the error details are different.
> The biggest concern are the new-to-2.4 non-zero exit code when we should have 
> a success and the hanging.
> The former looks like this:
> {code}
> ERROR NoMethodError: private method `exit' called for nil:NilClass
> {code}
> The change in error details for the error exit script also shows this same 
> detail.
> This behavior appears to be a side effect of HBASE-11686. As far as I can 
> tell, the IRB handling of 'exit' calls fail because we implement our own 
> handling of sessoins rather than rely on the intended session interface. We 
> never set a current session, and IRB's exit implementation presumes there 
> will be one.
> Running in debug shows this in a stacktrace:
> {code}
> Took 0.4563 seconds
> ERROR NoMethodError: private method `exit' called for nil:NilClass
> NoMethodError: private method `exit' called for nil:NilClass
>  irb_exit at 
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/irb/extend-command.rb:30
>  evaluate at stdin:2
>  eval at org/jruby/RubyKernel.java:1048
>  evaluate at 
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/irb/workspace.rb:85
>   eval_io at uri:classloader:/shell.rb:327
>  each_top_level_statement at 
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/irb/ruby-lex.rb:246
>  loop at org/jruby/RubyKernel.java:1442
>  each_top_level_statement at 
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/irb/ru

[jira] [Commented] (HBASE-26469) HBase shell has changed exit behavior

2022-01-07 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470964#comment-17470964
 ] 

Josh Elser commented on HBASE-26469:


Read through everything (including the original HBASE-11655 doc) and agree with 
the plan set here. Moreso than anything, for a patch release, we shouldn't be 
changing functionality (especially if folks are running in a script that 
{{{}set -e{}}}'s).

 
{quote}cli, default, error exit code (top right cell) should return 1. in patch 
it returns 0. historically this has returned 1. I do not believe we are helping 
operations folks by changing what this mode returns.
{quote}
This also makes sense to me.

> HBase shell has changed exit behavior
> -
>
> Key: HBASE-26469
> URL: https://issues.apache.org/jira/browse/HBASE-26469
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Affects Versions: 2.5.0, 3.0.0-alpha-2, 2.4.8
>Reporter: Sean Busbey
>Assignee: Sean Busbey
>Priority: Critical
> Fix For: 2.5.0, 2.6.0, 2.4.10
>
> Attachments: hbase-1.4.14-exit-behavior.log, 
> hbase-1.7.1-exit-behavior.log, hbase-2.0.6-exit-behavior.log, 
> hbase-2.1.9-exit-behavior.log, hbase-2.2.7-exit-behavior.log, 
> hbase-2.3.7-exit-behavior.log, hbase-2.4.8-exit-behavior.log, 
> hbase-3.0.0-alpha-2-exit-behavior.log
>
>
> The HBase shell has changed behavior in a way that breaks being able to exit 
> properly.
> Two example scripts to act as stand ins for hbase shell scripts to "do 
> something simple then exit":
> {code}
> tmp % echo "list\nexit" > clean_exit.rb
> tmp % echo "list\nexit 1" > error_exit.rb
> {code}
> Giving these two scripts is possible:
> * passed as a cli argument
> * via redirected stdin
> Additionally the shell invocation can be:
> * in the default compatibility mode
> * with the "non interactive" flag that gives an exit code that reflects 
> runtime errors
> I'll post logs of the details as attachments but here are some tables of the 
> exit codes.
> The {{clean_exit.rb}} invocations ought to exit with success, exit code 0.
> || ||  1.4.14 || 1.7.1 || 2.0.6 || 2.1.9 || 2.2.7 || 2.3.7 || 2.4.8 
> || master ||
> | cli, default |0 |0   |0   |0   |0   |0   |1   | 
>1*   |
> | cli, -n | 0 |0   |0   |0   |0   |0   |1   | 
>  hang   |
> | stdin, default |  0 |0   |0   |0   |0   |0   |0   | 
>0|
> | stdin, -n |   1 |1   |1   |1   |1   |1   |1*  | 
>1*   |
> The {{error_exit.rb}} invocation should return a non-zero exit code, unless 
> we're specifically trying to match a normal hbase shell session.
> || || 1.4.14 || 1.7.1 || 2.0.6 || 2.1.9 || 2.2.7 || 2.3.7 || 2.4.8 || 
> master ||
> | cli, default |   1 |1   |1   |1   |1   |1   |1*  |  
>   1*   |
> | cli, -n |1 |1   |1   |1   |1   |1   |1*  |  
> hang   |
> | stdin, default | 0 |0   |0   |0   |0   |0   |0   |  
>   0|
> | stdin, -n |  1 |1   |1   |1   |1   |1   |1*  |  
>   1*   |
> In cases marked with * the error details are different.
> The biggest concern are the new-to-2.4 non-zero exit code when we should have 
> a success and the hanging.
> The former looks like this:
> {code}
> ERROR NoMethodError: private method `exit' called for nil:NilClass
> {code}
> The change in error details for the error exit script also shows this same 
> detail.
> This behavior appears to be a side effect of HBASE-11686. As far as I can 
> tell, the IRB handling of 'exit' calls fail because we implement our own 
> handling of sessoins rather than rely on the intended session interface. We 
> never set a current session, and IRB's exit implementation presumes there 
> will be one.
> Running in debug shows this in a stacktrace:
> {code}
> Took 0.4563 seconds
> ERROR NoMethodError: private method `exit' called for nil:NilClass
> NoMethodError: private method `exit' called for nil:NilClass
>  irb_exit at 
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/irb/extend-command.rb:30
>  evaluate at stdin:2
>  eval at org/jruby/RubyKernel.java:1048
>  evaluate at 
> uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/irb/workspace.rb:85
>   eval_io at uri:classloader:/shell.rb:327
>  each_top_level_statement at 
> ur

[jira] [Created] (HBASE-26644) Spurious compaction failures with file tracker

2022-01-04 Thread Josh Elser (Jira)
Josh Elser created HBASE-26644:
--

 Summary: Spurious compaction failures with file tracker
 Key: HBASE-26644
 URL: https://issues.apache.org/jira/browse/HBASE-26644
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser


Noticed when running a basic {{{}hbase pe randomWrite{}}}, we'll see 
compactions failing at various points.

One example:
{noformat}
2022-01-03 17:41:18,319 ERROR [regionserver/localhost:16020-shortCompactions-0] 
regionserver.CompactSplit(670): Compaction failed 
region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
 storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
startTime=1641249666161
java.io.IOException: Root-level entries already added in single-level mode
  at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
  at 
org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
  at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
  at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
  at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
  at 
org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
  at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
  at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
  at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
  at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
  at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
  at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
  at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
  at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)  {noformat}
This isn't a super-critical issue because compactions will be retried 
automatically and they appear to eventually succeed. However, when the max 
storefiles limit is reaching, this does cause ingest to hang (as I was doing 
with my modest configuration).

We had seen a similar kind of problem in our testing when backporting to HBase 
2.4 (not upstream as the decision was to not do this) which we eventually 
tracked down to a bad merge-conflict resolution to the new HFile Cleaner. 
However, initial investigations don't have the same exact problem.

It seems that we have some kind of generic race condition. Would be good to add 
more logging to catch this in the future (since we have two separate instances 
of this category of bug already).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26644) Spurious compaction failures with file tracker

2022-01-04 Thread Josh Elser (Jira)
Josh Elser created HBASE-26644:
--

 Summary: Spurious compaction failures with file tracker
 Key: HBASE-26644
 URL: https://issues.apache.org/jira/browse/HBASE-26644
 Project: HBase
  Issue Type: Sub-task
Reporter: Josh Elser


Noticed when running a basic {{{}hbase pe randomWrite{}}}, we'll see 
compactions failing at various points.

One example:
{noformat}
2022-01-03 17:41:18,319 ERROR [regionserver/localhost:16020-shortCompactions-0] 
regionserver.CompactSplit(670): Compaction failed 
region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
 storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
startTime=1641249666161
java.io.IOException: Root-level entries already added in single-level mode
  at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
  at 
org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
  at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
  at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
  at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
  at 
org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
  at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
  at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
  at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
  at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
  at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
  at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
  at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
  at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)  {noformat}
This isn't a super-critical issue because compactions will be retried 
automatically and they appear to eventually succeed. However, when the max 
storefiles limit is reaching, this does cause ingest to hang (as I was doing 
with my modest configuration).

We had seen a similar kind of problem in our testing when backporting to HBase 
2.4 (not upstream as the decision was to not do this) which we eventually 
tracked down to a bad merge-conflict resolution to the new HFile Cleaner. 
However, initial investigations don't have the same exact problem.

It seems that we have some kind of generic race condition. Would be good to add 
more logging to catch this in the future (since we have two separate instances 
of this category of bug already).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26067) Change the way on how we track store file list

2022-01-04 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468703#comment-17468703
 ] 

Josh Elser commented on HBASE-26067:


{quote}Let me go for a merge and then resolve this issue
{quote}
Yeah, looks like this only affects the FILE tracker. Interesting. Thanks for 
merging Duo! Will pick up more over in HBASE-26584.

> Change the way on how we track store file list
> --
>
> Key: HBASE-26067
> URL: https://issues.apache.org/jira/browse/HBASE-26067
> Project: HBase
>  Issue Type: Umbrella
>  Components: HFile
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-3
>
>
> Open a separated jira to track the work since it can not be fully included in 
> HBASE-24749.
> I think this could be a landed prior to HBASE-24749, as if this works, we 
> could have different implementations for tracking store file list.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] The future of Tephra

2022-01-04 Thread Josh Elser
Agreed. As the person who did the work of pulling Tephra in from the 
incubator, I think we were already then in the state of "does someone 
actually care about Tephra?".


Without digging into the archives, I think someone was interested, but 
it seems like this never manifested.


+1 to remove Tephra integration from Phoenix.

On 1/3/22 1:38 PM, Viraj Jasani wrote:

+1 (unless any volunteer comes forward to support Tephra going forward)


On Mon, 3 Jan 2022 at 4:34 PM, Istvan Toth  wrote:


Hi!

As recently noticed by Lars, Tephra hasn't been working in Phoenix since
5.1/4.16 due to a bug.

The fact that this went unnoticed for a year, and the fact that generally
there seems to be minimal interest in Tephra suggests that we should
re-visit the decision to maintain Tephra within the Phoenix project.

The last two commits that were not aimed at fighting bit-rot, but were real
fixes were committed in Jun 2019 by Lars. In the last two and a half years,
all we did was try to keep ahead of bit-rot, so that Tephra keeps up with
new HBase and maven releases, and the changes in the CI infra.

Tephra uses an old Guava version, and depends heavily on the retired Apache
Twill project.
This is a major tech debt, and an adoption blocker (CVEs in direct Tephra
dependencies), which is also carried over into the Phoenix dependencies and
shaded artifacts that we should rectify.
PHOENIX-6064  , which
broke Tephra support, itself is a workaround so that we can avoid shipping
Tephra, and its problematic dependencies.

Ripping out Twill, and updating Guava and other dependencies is a
non-trivial amount of work (I estimate 1-4 weeks, depending on familiarity
with Tephra/Twill/Guava).

At the moment, no-one seems to be interested enough in Tephra to bring its
tech debt to acceptable levels, and in fact no-one seems to be using it
with any recent Phoenix release (as it doesn't work in them).

I suggest that you also check out the discussion between Lars and me in
https://issues.apache.org/jira/browse/PHOENIX-6615 for some more details
and background.

Based on the above, I propose retiring Tephra, and removing Tephra support
from Phoenix 5.2 / 4.17, unless someone steps up to solve the above issues
and maintain Tephra.

Note that this would not mean dropping transaction support from Phoenix, as
Omid support is in much better shape, and is actively used.

Please share your thoughts on the issue, if you are using Tephra and/or can
commit to solving the issues above, or if you agree on its removal, or any
other suggestions or objections.

regards
Istvan





[jira] [Commented] (HBASE-26067) Change the way on how we track store file list

2022-01-04 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468655#comment-17468655
 ] 

Josh Elser commented on HBASE-26067:


{quote}We can file an issue to fix the above problem first. I think if all the 
UTs are OK, we can merge it first, and then start to fix other problems.
[|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=13387830=17468318]{quote}
Let me test real quick to figure out if this also affects HBase when the 
default tracker is being used. If the compaction failure only happens with the 
FILE tracker, I agree we can move ahead with a merge :)

> Change the way on how we track store file list
> --
>
> Key: HBASE-26067
> URL: https://issues.apache.org/jira/browse/HBASE-26067
> Project: HBase
>  Issue Type: Umbrella
>  Components: HFile
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> Open a separated jira to track the work since it can not be fully included in 
> HBASE-24749.
> I think this could be a landed prior to HBASE-24749, as if this works, we 
> could have different implementations for tracking store file list.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26067) Change the way on how we track store file list

2022-01-03 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468268#comment-17468268
 ] 

Josh Elser edited comment on HBASE-26067 at 1/3/22, 11:20 PM:
--

{quote}I'm looking to see if I can spot something obvious, but maybe the same 
thing happened here.
{quote}
I think I see one thing in DefaultCompactor. -I'll tag you on a PR-

This is what I thought was a problem, but after looking at the master branch, 
maybe it's not the cause of the IOException above

[https://github.com/apache/hbase/blob/HBASE-26067-branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/DefaultCompactor.java#L87]
{code:java}
  protected void abortWriter(StoreFileWriter writer) throws IOException {
Path leftoverFile = writer.getPath();
try {
  writer.close();
} catch (IOException e) {
  LOG.warn("Failed to close the writer after an unfinished compaction.", e);
} finally {
  //this step signals that the target file is no longer writen and can be 
cleaned up
  writer = null;
}
...
  }{code}
Instead of null'ing out the member {{writer}} on the parent Compactor.java 
class, we're just null'ing out the local variable {{{}writer{}}}. I assume this 
is wrong, but I think we have this wrong on the branch-2 backport and in master.


was (Author: elserj):
{quote}I'm looking to see if I can spot something obvious, but maybe the same 
thing happened here.
{quote}
I think I see one thing in DefaultCompactor. I'll tag you on a PR

> Change the way on how we track store file list
> --
>
> Key: HBASE-26067
> URL: https://issues.apache.org/jira/browse/HBASE-26067
> Project: HBase
>  Issue Type: Umbrella
>  Components: HFile
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> Open a separated jira to track the work since it can not be fully included in 
> HBASE-24749.
> I think this could be a landed prior to HBASE-24749, as if this works, we 
> could have different implementations for tracking store file list.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26067) Change the way on how we track store file list

2022-01-03 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468268#comment-17468268
 ] 

Josh Elser commented on HBASE-26067:


{quote}I'm looking to see if I can spot something obvious, but maybe the same 
thing happened here.
{quote}
I think I see one thing in DefaultCompactor. I'll tag you on a PR

> Change the way on how we track store file list
> --
>
> Key: HBASE-26067
> URL: https://issues.apache.org/jira/browse/HBASE-26067
> Project: HBase
>  Issue Type: Umbrella
>  Components: HFile
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> Open a separated jira to track the work since it can not be fully included in 
> HBASE-24749.
> I think this could be a landed prior to HBASE-24749, as if this works, we 
> could have different implementations for tracking store file list.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26067) Change the way on how we track store file list

2022-01-03 Thread Josh Elser (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468264#comment-17468264
 ] 

Josh Elser commented on HBASE-26067:


Ran a simple {{hbase pe randomWrite}} and I'm noticing one thing that we also 
saw when backporting this to branch-2.4 (internally)
{noformat}
2022-01-03 17:40:47,184 INFO  [MemStoreFlusher.1] 
regionserver.DefaultStoreFlusher(81): Flushed memstore data size=73.26 MB at 
sequenceid=4537 (bloomFilter=true), 
to=hdfs://mizar.cloudera:8020/hbase-2.6/data/default/TestTable/951d5e954f95ab58f224fe80a77bea56/info0/7732094
      d69a94641a56965c4fb0d1947 {noformat}
and
{noformat}
2022-01-03 17:41:18,319 ERROR [regionserver/localhost:16020-shortCompactions-0] 
regionserver.CompactSplit(670): Compaction failed 
region=TestTable,0004054490,1641249249856.2dc7251c6eceb660b9c7bb0b587db913.,
 storeName=2dc7251c6eceb660b9c7bb0b587db913/info0,       priority=6, 
startTime=1641249666161
java.io.IOException: Root-level entries already added in single-level mode
  at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexWriter.writeSingleLevelIndex(HFileBlockIndex.java:1136)
  at 
org.apache.hadoop.hbase.io.hfile.CompoundBloomFilterWriter$MetaWriter.write(CompoundBloomFilterWriter.java:279)
  at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl$1.writeToBlock(HFileWriterImpl.java:713)
  at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.writeBlock(HFileBlock.java:1205)
  at 
org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.close(HFileWriterImpl.java:660)
  at 
org.apache.hadoop.hbase.regionserver.StoreFileWriter.close(StoreFileWriter.java:377)
  at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.commitWriter(DefaultCompactor.java:70)
  at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:386)
  at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:62)
  at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:125)
  at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1141)
  at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2388)
  at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:654)
  at 
org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:697)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748) {noformat}
We had figured out that this started happening when we backported HBASE-26271. 
[~bszabolcs] actually realized that we had transposed the arguments in 
{{sinkFactory.createWriter}} (substituting the dropCache for 
request.isMajor()). I'm looking to see if I can spot something obvious, but 
maybe the same thing happened here.

> Change the way on how we track store file list
> --
>
> Key: HBASE-26067
> URL: https://issues.apache.org/jira/browse/HBASE-26067
> Project: HBase
>  Issue Type: Umbrella
>  Components: HFile
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0-alpha-3
>
>
> Open a separated jira to track the work since it can not be fully included in 
> HBASE-24749.
> I think this could be a landed prior to HBASE-24749, as if this works, we 
> could have different implementations for tracking store file list.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   6   7   8   9   10   >