> This stuff was designed in years before the Procedure framework was written. What we had to work with is ZK watchers.
I see, this aligns with what I got to know today about the workflow. Yes, w.r.t ZK watchers, there is no change b/ HBase 1 and 2. Thanks Andrew for the clarification! > The propagation of cache updates could be made synchronous with the update itself if the grant and revoke admin apis were refactored as Procedures. The Procedure would have a sub-procedure that establishes a barrier around the cache refresh at all live regionservers, presumably. There is a JIRA for that work filed somewhere, an old one. Yes, I saw the Jira as well today, HBASE-21602. Good to see significant no of sub-tasks are already completed. But overall I understand that since this is more of an operator initiated action, I don't think there is a problem with existing design. The only thing I wanted to make sure, HBase 1 and 2 don't have any behavioural change in the workflow (other than HBase 1 using coproc endpoint directly on RS vs HBase 2 using Admin APIs). On Mon, May 31, 2021 at 10:07 PM Andrew Purtell <[email protected]> wrote: > In the HBase security unit tests we wait for the zk mediated cache reload > to complete before continuing the test. If you don’t do that your test will > be flaky. Check that you are doing this. If you are not, then your tests > will be flaky. In the real world, the operator issues the grant or revoke > command and we have the human time scale of ~seconds (at least) for the > change to propagate. This is not a bug. This stuff was designed in years > before the Procedure framework was written. What we had to work with is ZK > watchers. Those are processed asynchronously with respect to the admin API. > Because a unit test runs much faster than a human administrative act, you > have to wait for more than just the admin API to complete. The cache in > each regionserver’s access controller has a last update timestamp and you > should wait for them to all update. Look at the HBase unit tests to see how > that is done. > > The propagation of cache updates could be made synchronous with the update > itself if the grant and revoke admin apis were refactored as Procedures. > The Procedure would have a sub-procedure that establishes a barrier around > the cache refresh at all live regionservers, presumably. There is a JIRA > for that work filed somewhere, an old one. It’s not been important enough > for someone to pick up and perform. Don’t expect this to change. > > > > On May 31, 2021, at 6:52 AM, Viraj Jasani <[email protected]> wrote: > > > > For Permission APIs grant/revoke, I just checked the diff b/ HBase 1 and > > HBase 2. > > Phoenix MetaDataClient uses AccessControlClient to grant/revoke > permission > > to a user. HBase 1 AccessControlClient invokes coproc > > endpoint AccessController, which puts records in hbase:acl table on > > RegionServer hosting acl's single region. HBase 2 AccessControlClient on > > the other hand invokes Admin APIs to let master take care of adding > records > > in hbase:acl table (HBASE-21739). This is the main diff b/ both versions, > > however, both workflows seem synchronous. > > For both HBase 1 and 2, post-put hook updates ZNodes with updated entries > > in acl table, which internally triggers a refresh of table cache in > > AuthManager, and this table cache could be used to verify Auth of user on > > table/namespace in subsequent request (based on our test case: remove > > access, followed by verification of grant permission). If table cache has > > race condition, it should also be applicable to HBase 1 as well I > believe, > > yet to compare both at length. > > > > I understand, Phoenix is not doing anything different w.r.t GRANT/REVOKE > > statements so some sync issue might have been introduced in HBase 2 > > unknowingly. I will try to dig more sometime next week and file HBase > Jira > > if I could find some evidence of race conditions or could reproduce with > > HBase tests directly. > > > >> On Mon, May 31, 2021 at 11:40 AM Istvan Toth <[email protected]> wrote: > >> > >> I agree that we are ready for the RC, and that deflaking the tests is > >> great. > >> > >> However, I think that Permissions test failures are exposing a real > problem > >> (probably in HBase), > >> and we've only adopted the tests for the buggy behaviour of HBase, and > >> should track down / fix > >> the root cause. > >> > >> On Sun, May 30, 2021 at 4:46 PM Viraj Jasani <[email protected]> > wrote: > >> > >>>> IMHO, a flapper is worse than a failing test because it undermines the > >>> confidence in a test run. > >>> > >>> I concur. In fact, in the last few builds, I have seen a few flappers > >> (e.g > >>> testUpsertSelectWithMultiByteCharsAutoCommit, testAsyncRebuildAll) > whose > >>> resolution was attempted just a few months back and now they are > showing > >> up > >>> again (with much lower frequency than before), so yes we need some > >>> attention again. > >>> > >>> After fixing consistent failure with testPherfMain (PHOENIX-6482) and > >> high > >>> frequency flappers with BasePermissionsIT and AuditLoggingIT > >>> (PHOENIX-6483), CI builds are in better shape now. > >>> In the recent build, I can see no test failures with 3/4 HBase > profiles. > >> I > >>> will file a Jira for JaCoCo report generation failure, which keeps > >> showing > >>> up often (flapper with JaCoCo issue, not related to test failure). > >>> > >>> Given that CI builds are in better shape, planning to start an RC soon. > >>> Thanks > >>> > >>> > >>> On Sat, May 29, 2021 at 9:45 PM [email protected] <[email protected]> > >> wrote: > >>> > >>>> Nice! > >>>> > >>>> I noticed we had some successful 5.1 runs too. > >>>> Fixing the tests would need some attention again. Flappers should > >> either > >>>> be fixed or disabled. > >>>> > >>>> IMHO, a flapper is worse than a failing test because it undermines the > >>>> confidence in a test run. > >>>> If the tests always pass you'll notice a failure right away. > >>>> But when most of the runs produce some kind of failure it is human > >> nature > >>>> to start ignoring them, and then new, real failures pile up. > >>>> > >>>> > >>>> -- Lars > >>>> > >>>> > >>>> On Friday, May 28, 2021, 11:36:53 PM PDT, Viraj Jasani < > >>> [email protected]> > >>>> wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Due to incorrect/missing fix versions, found some discrepancies b/ > >> master > >>>> and 5.1 in the Pherf module. > >>>> Backported recent improvements to 5.1 including test fix: > >>>> PHOENIX-6417, PHOENIX-6118, PHOENIX-6430, PHOENIX-6429, PHOENIX-6431, > >>>> PHOENIX-6432 > >>>> > >>>> On Sat, May 29, 2021 at 12:20 AM Viraj Jasani <[email protected]> > >>> wrote: > >>>> > >>>>> I looked at CI build today and there are a couple of flakies in > >>>> core/pherf. > >>>>> > >>>>> 3 major flaky tests: PermissionNSEnabledIT, PermissionsCacheIT, > >>>>> AuditLoggingIT (reported with HBase 2.2/2.3/2.4). Permission tests > >> have > >>>>> been flaky for quite some time. IIRC, there was some attempt to fix > >>> them > >>>> as > >>>>> well but still they appear with AccessDeniedException at times. Not > >>> much > >>>>> sure about AuditLoggingIT though. > >>>>> > >>>>> Pherf test fails with HBase 2.1. > >>>>> > >>>>> > >>>>> On Fri, 28 May 2021 at 10:44 PM, [email protected] <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> 5.1 CI builds seems to fail consistently - at least for the past two > >>>>>> weeks. > >>>>>> > >>>>>> > >>>> > >>> > >> > https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-mulitbranch/job/5.1/ > >>>>>> > >>>>>> Might be something to look into. > >>>>>> I have an hour today between meetings, I'll see what I can do. :) > >>>>>> > >>>>>> -- Lars > >>>>>> > >>>>>> > >>>>>> On Wednesday, May 26, 2021, 10:54:13 PM PDT, Viraj Jasani < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Thanks Lars. I had an offline chat with Istvan yesterday and I will > >>>>>> prepare > >>>>>> RCs over the weekend. > >>>>>> > >>>>>> In the meantime, I will again take a look at Jira fixVersion and git > >>>>>> commits compatibility and also try to help with any backport (if > >> still > >>>>>> pending). > >>>>>> > >>>>>> > >>>>>> On Wed, 26 May 2021 at 11:53 PM, [email protected] <[email protected] > >>> > >>>>>> wrote: > >>>>>> > >>>>>>> I cherry-picked three changes into branch 5.1. > >>>>>>> Assuming Jira is up-to-date, there are no 4.16.x issues that are > >> not > >>>>>> also > >>>>>>> in 5.1.x. > >>>>>>> > >>>>>>> There are 12 issues left that are in 4.17.0 and 5.2.0, but not in > >>>> 5.1.x. > >>>>>>> But they are presumably not in 4.16.x because they violate > >> backward > >>>>>>> compatibility, and - if so - should not be in 5.1.2. > >>>>>>> > >>>>>>> -- Lars > >>>>>>> > >>>>>>> On Wednesday, May 26, 2021, 10:15:31 AM PDT, [email protected] < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Awesome. Thanks Istvan. > >>>>>>> > >>>>>>> I'll do a pass through the issues too. > >>>>>>> > >>>>>>> -- Lars > >>>>>>> > >>>>>>> > >>>>>>> On Tuesday, May 25, 2021, 8:45:00 PM PDT, Istvan Toth < > >>>> [email protected] > >>>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> I am not aware of any blockers in 5.1. > >>>>>>> If there is no objection, and no other volunteers for the 5.1.2 RM > >>>>>> role, I > >>>>>>> intend to start the process on Monday. > >>>>>>> > >>>>>>> I will also look for missing fixes before release, but if anyone > >> is > >>>>>> aware > >>>>>>> of any fix that is needed in 5.1.2 , but > >>>>>>> hasn't been backported, then please try to get it resolved by > >>> Monday. > >>>>>>> > >>>>>>> regards > >>>>>>> Istvan > >>>>>>> > >>>>>>> On Wed, May 26, 2021 at 4:19 AM [email protected] < > >> [email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>>> There are 17 resolved issues against it. > >>>>>>>> PHOENIX-6436 is needed for further integration with Trino > >>> (formerly > >>>>>>>> Presto). > >>>>>>>> > >>>>>>>> It is time for another point release? > >>>>>>>> > >>>>>>>> -- Lars > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> >
