keith-turner opened a new pull request, #6139:
URL: https://github.com/apache/accumulo/pull/6139

   Supports partitioning fate processing across multiple manager processes. 
When multiple manager processes are started one will become primary and do 
mostly what the manager did before this change.  However, processing of user 
fate operations is now spread across all manager processes.  The following is a 
high level guide to these changes.
   
    * New ManagerAssistant class that supports running task assigned by the 
primary manager process.  Currently it only supports fate. This class runs in 
every manager process.  This class does the following.
      * Gets its own lock in zookeeper separate from the primary manager lock. 
This lock is at `/managers/assistants` in ZK.  This lock is like a tserver or 
compactor lock.  Every manager process will be involved in two locks, one to 
determine who is primary and one for the assistant functionality.
      * Starts a thrift server that can accept assignments of fate ranges to 
process. This second thrift server was needed in the manager because the 
primary thrift server is not started until after the primary manager gets its 
lock.
      * In the manager startup sequence this class is created and started 
before the manager waits for its primary lock.  This allows non-primary 
managers to receive work from the primary via RPC.
      * In the future, this lock and thrift service could be used by the 
primary manager to delegate more task like compaction coordination, table 
management, balancing, etc.  Each functionality could be partitioned in its own 
way and have its own RPCs for delegation.
      * This class creates its own SeverContext.  This was needed because the 
server context has a reference to the server lock.  Since a second lock was 
created, needed another server context.  Would like to try to improve this in a 
follow on change as it seems likely to make the code harder to maintain and 
understand.
      * Because this class does not extend AbstractServer it does not get some 
of the benefits that class offers like monitoring of its new lock.  Would like 
to improve this in a follow on issue.
    * New FateWorker class.  This runs in the ManagerAssistant and handles 
request from the primary manager to adjust what range of the fate table its 
currently working on.
    * New FateManager class that is run by the primary manager and is 
responsible for partitioning fate processing across all assistant managers. As 
manager processes come and go this will repartition the fate table evenly 
across the managers.
    * Some new RPCs for best effort notifications. Before these changes there 
were in memory notification systems that made the manager more responsive.  
These would allow a fate operation to signal the Tablet Group Watcher to take 
action sooner.  FateWorkerEnv sends these notifications to the primary manger 
over a new RPC.  Does not matter if they are lost, things will eventually 
happen.
    * Some adjustment of the order in which metrics were setup in the startup 
sequence was needed to make things work. Have not yet tested metrics w/ these 
changes.
    * Broke Fate class into Fate and FateClient class.  The FateClient class 
supports starting and checking on fate operations. Most code uses the 
FateClient.  This breakup was needed as the primary manager will interact with 
the FateClient to start operations and check on their status. Fate extends 
FateClient to minimize code changes, but it does not need to. In a follow on 
would like to remove this extension.
    * Fate operations update two in memory data structures related to bulk 
import and running compactions.  These updates are no longer done.  Would like 
reexamine the need for these in follow on issues.
    * Two new tests :
       * MultipleManagerIT : tests starting and stopping managers and ensures 
fate runs across all managers correctly.
       * ComprehensiveMultiManagerIT : tests all accumulo APIs w/ three 
managers running.  Does not start and stop managers.
   
   This change needs some user facing follow on work to provide information to 
the user.  Need to update the service status command.  Also need to update 
listing of running fate operations to show where they are running.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to