[jira] [Updated] (SENTRY-1891) SentryPlugin triggers full update due to concurrency bug

Vadim Spector (JIRA) Fri, 18 Aug 2017 10:25:24 -0700

     [ 
https://issues.apache.org/jira/browse/SENTRY-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vadim Spector updated SENTRY-1891:
----------------------------------
    Description: 
Sentry server can trigger full update to NameNode for no good reason, in high 
load cases, due to the bug in concurrency handling. Full update is always done 
at the initialization time and for large number of permissions and/or hdfs 
paths it takes lots of time and significantly increases (temporarily) heap 
size. It is performed during normal operations only if Sentry server decides 
that it has gaps in the partial updates' sequence numbers, so it restores the 
entire snapshot. Full update during normal operations is a highly disruptive 
event, leading to huge increasw in heap size, system slowdown, and even crash. 
Below is the explanation of the problem:

====================================================================

*SentryPlugin.java* has multiple permission update methods, such as 
_onDropSentryRole()_, _onDropSentryPrivilege()_, 
_onAlterSentryRoleDeleteGroups()_, etc. They, of course, can be called 
concurrently. And they all delegate work to 
_permsUpdater.handleUpdateNotification(update)_ call, which, in turn, 
serializes those requests by turning them into the Runnable tasks and 
submitting them into a single-threaded thread pool. Each job ends up appending 
an update to the updates list. So far so good, all updates are serialized.

However, each of those methods first creates _PermissionsUpdate_ object with 
auto-incremented sequence number, and only then passes this object to 
_permsUpdate.handleUpdateNotification(update)_. Here is typical code snippet:

{quote}PermissionsUpdate update = new 
PermissionsUpdate(permSeqNum.incrementAndGet(), false);
// what if another permission update thread preempts this one right here and 
starts and finishes the whole method ???
update.addPrivilegeUpdate(authzObj).putToAddPrivileges(roleName, 
privilege.getAction().toUpperCase());
// ... or here ???
permsUpdater.handleUpdateNotification(update); 
LOGGER.debug("Authz Perm preUpdate [" + update.getSeqNum() + "]..");{quote}

The problem is that sequence number assignment to _PermissionsUpdate_ object 
and appending this object to the updates list (by calling 
_permsUpdater.handleUpdateNotification(update)_ ) must be an atomic pair of 
operations, and it is not. In multi-threaded environment sooner or later we can 
end up with appending updates into the updates list with  out of order sequence 
numbers.

The problem is not that the updates may end up appended to the updates list not 
in the order of their arrival. We never guaranteed the right order for 
concurrent requests coming from different clients. The problem is that they end 
up appended to the list with their sequence numbers out of order, e.g. #101, 
#100, instead of #100, #101. Then handleUpdateNotification() method has the 
following lines:

{quote}final boolean editNotMissed = lastSeenSeqNum.incrementAndGet() == 
update.getSeqNum();{quote}

Suppose lastSeenSeqNum was 99, and the next appended update is #101. Then, 
obviously, 99+1 != 101, so editNotMissed is set to false. It triggers full 
update logic a few lines of code later:

bq. if (imageRetreiver != null) {
bq.               try {
bq.                 toUpdate = 
imageRetreiver.retrieveFullImage(update.getSeqNum());
bq.               } catch (Exception e) {
bq.                 LOGGER.warn("failed to retrieve full image: ", e);
bq.               }
bq.               updateable = updateable.updateFull(toUpdate);
bq. }

Since imageRetriever != null for permissions updater, all the hell breaks loose.

It can be fixed in _SentryPlugin.java_ for all permission update methods as 
follows:

{quote}PermissionsUpdate update;
synchronized (permsUpdateLock) { // now sequence number increment and adding 
update to the list are performed atomically
  update = new PermissionsUpdate(permSeqNum.incrementAndGet(), false);
   update.addPrivilegeUpdate(authzObj).putToAddPrivileges(roleName, 
privilege.getAction().toUpperCase());
   permsUpdater.handleUpdateNotification(update);
}
LOGGER.debug("Authz Perm preUpdate [" + update.getSeqNum() + "]..");
{quote}

 _handleUpdateNotification()_ is always fast since it submits real work to 
thread pool; so, it does not block, and introducing additional synchronization 
should not affect concurrency.

  was:
Sentry server can trigger full update to NameNode for no good reason, in high 
load cases, due to the bug in concurrency handling. Full update is always done 
at the initialization time and for large number of permissions and/or hdfs 
paths it takes lots of time and significantly increases (temporarily) heap 
size. It is performed during normal operations only if Sentry server decides 
that it has gaps in the partial updates' sequence numbers, so it restores the 
entire snapshot. Full update during normal operations is a highly disruptive 
event, leading to huge increasw in heap size, system slowdown, and even crash. 
Below is the explanation of the problem:

====================================================================

*SentryPlugin.java* has multiple permission update methods, such as 
_onDropSentryRole()_, _onDropSentryPrivilege()_, 
_onAlterSentryRoleDeleteGroups()_, etc. They, of course, can be called 
concurrently. And they all delegate work to 
_permsUpdater.handleUpdateNotification(update)_ call, which, in turn, 
serializes those requests by turning them into the Runnable tasks and 
submitting them into a single-threaded thread pool. Each job ends up appending 
an update to the updates list. So far so good, all updates are serialized.

However, each of those methods first creates _PermissionsUpdate_ object with 
auto-incremented sequence number, and only then passes this object to 
_permsUpdate.handleUpdateNotification(update)_. Here is typical code snippet:

{quote}PermissionsUpdate update = new 
PermissionsUpdate(permSeqNum.incrementAndGet(), false);
// what if another permission update thread preempts this one right here and 
starts and finishes the whole method ???
update.addPrivilegeUpdate(authzObj).putToAddPrivileges(roleName, 
privilege.getAction().toUpperCase());
// ... or here ???
permsUpdater.handleUpdateNotification(update); 
LOGGER.debug("Authz Perm preUpdate [" + update.getSeqNum() + "]..");{quote}

The problem is that sequence number assignment to _PermissionsUpdate_ object 
and appending this object to the updates list (by calling 
_permsUpdater.handleUpdateNotification(update)_ ) must be an atomic pair of 
operations, and it is not. In multi-threaded environment sooner or later we can 
end up with appending updates into the updates list with  out of order sequence 
numbers.

The problem is not that the updates may end up appended to the updates list not 
in the order of their arrival. We never guaranteed the right order for 
concurrent requests coming from different clients. The problem is that they end 
up appended to the list with their sequence numbers out of order, e.g. #101, 
#100, instead of #100, #101. Then handleUpdateNotification() method has the 
following lines:

{quote}final boolean editNotMissed = lastSeenSeqNum.incrementAndGet() == 
update.getSeqNum();{quote}

Suppose lastSeenSeqNum was 99, and the next appended update is #101. Then, 
obviously, 99+1 != 101, so editNotMissed is set to false. It triggers full 
update logic a few lines of code later:

{quote}if (imageRetreiver != null) {
              try {
                toUpdate = imageRetreiver.retrieveFullImage(update.getSeqNum());
              } catch (Exception e) {
                LOGGER.warn("failed to retrieve full image: ", e);
              }
              updateable = updateable.updateFull(toUpdate);
}
{quote}

Since imageRetriever != null for permissions updater, all the hell breaks loose.

It can be fixed in _SentryPlugin.java_ for all permission update methods as 
follows:

{quote}PermissionsUpdate update;
synchronized (permsUpdateLock) { // now sequence number increment and adding 
update to the list are performed atomically
  update = new PermissionsUpdate(permSeqNum.incrementAndGet(), false);
   update.addPrivilegeUpdate(authzObj).putToAddPrivileges(roleName, 
privilege.getAction().toUpperCase());
   permsUpdater.handleUpdateNotification(update);
}
LOGGER.debug("Authz Perm preUpdate [" + update.getSeqNum() + "]..");
{quote}

 _handleUpdateNotification()_ is always fast since it submits real work to 
thread pool; so, it does not block, and introducing additional synchronization 
should not affect concurrency.


> SentryPlugin triggers full update due to concurrency bug
> --------------------------------------------------------
>
>                 Key: SENTRY-1891
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1891
>             Project: Sentry
>          Issue Type: Bug
>            Reporter: Vadim Spector
>
> Sentry server can trigger full update to NameNode for no good reason, in high 
> load cases, due to the bug in concurrency handling. Full update is always 
> done at the initialization time and for large number of permissions and/or 
> hdfs paths it takes lots of time and significantly increases (temporarily) 
> heap size. It is performed during normal operations only if Sentry server 
> decides that it has gaps in the partial updates' sequence numbers, so it 
> restores the entire snapshot. Full update during normal operations is a 
> highly disruptive event, leading to huge increasw in heap size, system 
> slowdown, and even crash. Below is the explanation of the problem:
> ====================================================================
> *SentryPlugin.java* has multiple permission update methods, such as 
> _onDropSentryRole()_, _onDropSentryPrivilege()_, 
> _onAlterSentryRoleDeleteGroups()_, etc. They, of course, can be called 
> concurrently. And they all delegate work to 
> _permsUpdater.handleUpdateNotification(update)_ call, which, in turn, 
> serializes those requests by turning them into the Runnable tasks and 
> submitting them into a single-threaded thread pool. Each job ends up 
> appending an update to the updates list. So far so good, all updates are 
> serialized.
> However, each of those methods first creates _PermissionsUpdate_ object with 
> auto-incremented sequence number, and only then passes this object to 
> _permsUpdate.handleUpdateNotification(update)_. Here is typical code snippet:
> {quote}PermissionsUpdate update = new 
> PermissionsUpdate(permSeqNum.incrementAndGet(), false);
> // what if another permission update thread preempts this one right here and 
> starts and finishes the whole method ???
> update.addPrivilegeUpdate(authzObj).putToAddPrivileges(roleName, 
> privilege.getAction().toUpperCase());
> // ... or here ???
> permsUpdater.handleUpdateNotification(update); 
> LOGGER.debug("Authz Perm preUpdate [" + update.getSeqNum() + "]..");{quote}
> The problem is that sequence number assignment to _PermissionsUpdate_ object 
> and appending this object to the updates list (by calling 
> _permsUpdater.handleUpdateNotification(update)_ ) must be an atomic pair of 
> operations, and it is not. In multi-threaded environment sooner or later we 
> can end up with appending updates into the updates list with  out of order 
> sequence numbers.
> The problem is not that the updates may end up appended to the updates list 
> not in the order of their arrival. We never guaranteed the right order for 
> concurrent requests coming from different clients. The problem is that they 
> end up appended to the list with their sequence numbers out of order, e.g. 
> #101, #100, instead of #100, #101. Then handleUpdateNotification() method has 
> the following lines:
> {quote}final boolean editNotMissed = lastSeenSeqNum.incrementAndGet() == 
> update.getSeqNum();{quote}
> Suppose lastSeenSeqNum was 99, and the next appended update is #101. Then, 
> obviously, 99+1 != 101, so editNotMissed is set to false. It triggers full 
> update logic a few lines of code later:
> bq. if (imageRetreiver != null) {
> bq.               try {
> bq.                 toUpdate = 
> imageRetreiver.retrieveFullImage(update.getSeqNum());
> bq.               } catch (Exception e) {
> bq.                 LOGGER.warn("failed to retrieve full image: ", e);
> bq.               }
> bq.               updateable = updateable.updateFull(toUpdate);
> bq. }
> Since imageRetriever != null for permissions updater, all the hell breaks 
> loose.
> It can be fixed in _SentryPlugin.java_ for all permission update methods as 
> follows:
> {quote}PermissionsUpdate update;
> synchronized (permsUpdateLock) { // now sequence number increment and adding 
> update to the list are performed atomically
>   update = new PermissionsUpdate(permSeqNum.incrementAndGet(), false);
>    update.addPrivilegeUpdate(authzObj).putToAddPrivileges(roleName, 
> privilege.getAction().toUpperCase());
>    permsUpdater.handleUpdateNotification(update);
> }
> LOGGER.debug("Authz Perm preUpdate [" + update.getSeqNum() + "]..");
> {quote}
>  _handleUpdateNotification()_ is always fast since it submits real work to 
> thread pool; so, it does not block, and introducing additional 
> synchronization should not affect concurrency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SENTRY-1891) SentryPlugin triggers full update due to concurrency bug

Reply via email to