> This stuff was designed in years before the Procedure framework was
written. What we had to work with is ZK watchers.

I see, this aligns with what I got to know today about the workflow. Yes,
w.r.t ZK watchers, there is no change b/ HBase 1 and 2. Thanks Andrew for
the clarification!

> The propagation of cache updates could be made synchronous with the
update itself if the grant and revoke admin apis were refactored as
Procedures. The Procedure would have a sub-procedure that establishes a
barrier around the cache refresh at all live regionservers, presumably.
There is a JIRA for that work filed somewhere, an old one.

Yes, I saw the Jira as well today, HBASE-21602. Good to see significant no
of sub-tasks are already completed.
But overall I understand that since this is more of an operator initiated
action, I don't think there is a problem with existing design. The only
thing I wanted to make sure, HBase 1 and 2 don't have any behavioural
change in the workflow (other than HBase 1 using coproc endpoint directly
on RS vs HBase 2 using Admin APIs).

On Mon, May 31, 2021 at 10:07 PM Andrew Purtell <[email protected]>
wrote:

> In the HBase security unit tests we wait for the zk mediated cache reload
> to complete before continuing the test. If you don’t do that your test will
> be flaky. Check that you are doing this. If you are not, then your tests
> will be flaky. In the real world, the operator issues the grant or revoke
> command and we have the human time scale of ~seconds (at least) for the
> change to propagate. This is not a bug. This stuff was designed in years
> before the Procedure framework was written. What we had to work with is ZK
> watchers. Those are processed asynchronously with respect to the admin API.
> Because a unit test runs much faster than a human administrative act, you
> have to wait for more than just the admin API to complete. The cache in
> each regionserver’s access controller has a last update timestamp and you
> should wait for them to all update. Look at the HBase unit tests to see how
> that is done.
>
> The propagation of cache updates could be made synchronous with the update
> itself if the grant and revoke admin apis were refactored as Procedures.
> The Procedure would have a sub-procedure that establishes a barrier around
> the cache refresh at all live regionservers, presumably. There is a JIRA
> for that work filed somewhere, an old one. It’s not been important enough
> for someone to pick up and perform. Don’t expect this to change.
>
>
> > On May 31, 2021, at 6:52 AM, Viraj Jasani <[email protected]> wrote:
> >
> > For Permission APIs grant/revoke, I just checked the diff b/ HBase 1 and
> > HBase 2.
> > Phoenix MetaDataClient uses AccessControlClient to grant/revoke
> permission
> > to a user. HBase 1 AccessControlClient invokes coproc
> > endpoint AccessController, which puts records in hbase:acl table on
> > RegionServer hosting acl's single region. HBase 2 AccessControlClient on
> > the other hand invokes Admin APIs to let master take care of adding
> records
> > in hbase:acl table (HBASE-21739). This is the main diff b/ both versions,
> > however, both workflows seem synchronous.
> > For both HBase 1 and 2, post-put hook updates ZNodes with updated entries
> > in acl table, which internally triggers a refresh of table cache in
> > AuthManager, and this table cache could be used to verify Auth of user on
> > table/namespace in subsequent request (based on our test case: remove
> > access, followed by verification of grant permission). If table cache has
> > race condition, it should also be applicable to HBase 1 as well I
> believe,
> > yet to compare both at length.
> >
> > I understand, Phoenix is not doing anything different w.r.t GRANT/REVOKE
> > statements so some sync issue might have been introduced in HBase 2
> > unknowingly. I will try to dig more sometime next week and file HBase
> Jira
> > if I could find some evidence of race conditions or could reproduce with
> > HBase tests directly.
> >
> >> On Mon, May 31, 2021 at 11:40 AM Istvan Toth <[email protected]> wrote:
> >>
> >> I agree that we are ready for the RC, and that deflaking the tests is
> >> great.
> >>
> >> However, I think that Permissions test failures are exposing a real
> problem
> >> (probably in HBase),
> >> and we've only adopted the tests for the buggy behaviour of HBase, and
> >> should track down / fix
> >> the root cause.
> >>
> >> On Sun, May 30, 2021 at 4:46 PM Viraj Jasani <[email protected]>
> wrote:
> >>
> >>>> IMHO, a flapper is worse than a failing test because it undermines the
> >>> confidence in a test run.
> >>>
> >>> I concur. In fact, in the last few builds, I have seen a few flappers
> >> (e.g
> >>> testUpsertSelectWithMultiByteCharsAutoCommit, testAsyncRebuildAll)
> whose
> >>> resolution was attempted just a few months back and now they are
> showing
> >> up
> >>> again (with much lower frequency than before), so yes we need some
> >>> attention again.
> >>>
> >>> After fixing consistent failure with testPherfMain (PHOENIX-6482) and
> >> high
> >>> frequency flappers with BasePermissionsIT and AuditLoggingIT
> >>> (PHOENIX-6483), CI builds are in better shape now.
> >>> In the recent build, I can see no test failures with 3/4 HBase
> profiles.
> >> I
> >>> will file a Jira for JaCoCo report generation failure, which keeps
> >> showing
> >>> up often (flapper with JaCoCo issue, not related to test failure).
> >>>
> >>> Given that CI builds are in better shape, planning to start an RC soon.
> >>> Thanks
> >>>
> >>>
> >>> On Sat, May 29, 2021 at 9:45 PM [email protected] <[email protected]>
> >> wrote:
> >>>
> >>>> Nice!
> >>>>
> >>>> I noticed we had some successful 5.1 runs too.
> >>>> Fixing the tests would need some attention again. Flappers should
> >> either
> >>>> be fixed or disabled.
> >>>>
> >>>> IMHO, a flapper is worse than a failing test because it undermines the
> >>>> confidence in a test run.
> >>>> If the tests always pass you'll notice a failure right away.
> >>>> But when most of the runs produce some kind of failure it is human
> >> nature
> >>>> to start ignoring them, and then new, real failures pile up.
> >>>>
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>> On Friday, May 28, 2021, 11:36:53 PM PDT, Viraj Jasani <
> >>> [email protected]>
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Due to incorrect/missing fix versions, found some discrepancies b/
> >> master
> >>>> and 5.1 in the Pherf module.
> >>>> Backported recent improvements to 5.1 including test fix:
> >>>> PHOENIX-6417, PHOENIX-6118, PHOENIX-6430, PHOENIX-6429, PHOENIX-6431,
> >>>> PHOENIX-6432
> >>>>
> >>>> On Sat, May 29, 2021 at 12:20 AM Viraj Jasani <[email protected]>
> >>> wrote:
> >>>>
> >>>>> I looked at CI build today and there are a couple of flakies in
> >>>> core/pherf.
> >>>>>
> >>>>> 3 major flaky tests: PermissionNSEnabledIT, PermissionsCacheIT,
> >>>>> AuditLoggingIT (reported with HBase 2.2/2.3/2.4). Permission tests
> >> have
> >>>>> been flaky for quite some time. IIRC, there was some attempt to fix
> >>> them
> >>>> as
> >>>>> well but still they appear with AccessDeniedException at times. Not
> >>> much
> >>>>> sure about AuditLoggingIT though.
> >>>>>
> >>>>> Pherf test fails with HBase 2.1.
> >>>>>
> >>>>>
> >>>>> On Fri, 28 May 2021 at 10:44 PM, [email protected] <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> 5.1 CI builds seems to fail consistently - at least for the past two
> >>>>>> weeks.
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> https://ci-hadoop.apache.org/job/Phoenix/job/Phoenix-mulitbranch/job/5.1/
> >>>>>>
> >>>>>> Might be something to look into.
> >>>>>> I have an hour today between meetings, I'll see what I can do. :)
> >>>>>>
> >>>>>> -- Lars
> >>>>>>
> >>>>>>
> >>>>>> On Wednesday, May 26, 2021, 10:54:13 PM PDT, Viraj Jasani <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks Lars. I had an offline chat with Istvan yesterday and I will
> >>>>>> prepare
> >>>>>> RCs over the weekend.
> >>>>>>
> >>>>>> In the meantime, I will again take a look at Jira fixVersion and git
> >>>>>> commits compatibility and also try to help with any backport (if
> >> still
> >>>>>> pending).
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 26 May 2021 at 11:53 PM, [email protected] <[email protected]
> >>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I cherry-picked three changes into branch 5.1.
> >>>>>>> Assuming Jira is up-to-date, there are no 4.16.x issues that are
> >> not
> >>>>>> also
> >>>>>>> in 5.1.x.
> >>>>>>>
> >>>>>>> There are 12 issues left that are in 4.17.0 and 5.2.0, but not in
> >>>> 5.1.x.
> >>>>>>> But they are presumably not in 4.16.x because they violate
> >> backward
> >>>>>>> compatibility, and - if so - should not be in 5.1.2.
> >>>>>>>
> >>>>>>> -- Lars
> >>>>>>>
> >>>>>>> On Wednesday, May 26, 2021, 10:15:31 AM PDT, [email protected] <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Awesome. Thanks Istvan.
> >>>>>>>
> >>>>>>> I'll do a pass through the issues too.
> >>>>>>>
> >>>>>>> -- Lars
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tuesday, May 25, 2021, 8:45:00 PM PDT, Istvan Toth <
> >>>> [email protected]
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I am not aware of any blockers in 5.1.
> >>>>>>> If there is no objection, and no other volunteers for the 5.1.2 RM
> >>>>>> role, I
> >>>>>>> intend to start the process on Monday.
> >>>>>>>
> >>>>>>> I will also look for missing fixes before release, but if anyone
> >> is
> >>>>>> aware
> >>>>>>> of any fix that is needed in 5.1.2 , but
> >>>>>>> hasn't been backported, then please try to get it resolved by
> >>> Monday.
> >>>>>>>
> >>>>>>> regards
> >>>>>>> Istvan
> >>>>>>>
> >>>>>>> On Wed, May 26, 2021 at 4:19 AM [email protected] <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> There are 17 resolved issues against it.
> >>>>>>>> PHOENIX-6436 is needed for further integration with Trino
> >>> (formerly
> >>>>>>>> Presto).
> >>>>>>>>
> >>>>>>>> It is time for another point release?
> >>>>>>>>
> >>>>>>>> -- Lars
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Reply via email to