[ 
https://issues.apache.org/jira/browse/HELIX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030755#comment-16030755
 ] 

Jiajun Wang edited comment on HELIX-659 at 7/10/17 7:19 PM:
------------------------------------------------------------

Based on all that is discussed above, let us imagine a resource represented by 
3 independent state models: MasterSlave, ReadWrite, and Versions. The following 
figure shows three possible state transitions for a replica of the resource.

!https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2213&x=-11&y=422&w=1124&h=396&store=1&accept=image%2F*&auth=LCA%20ef1c4685cd5e2f5bbded5596a92e76f1a84fb390-ts%3D1497894598!

Partition 1 has some internal error. So although it is still the master, it is 
transited to "Error" state. Meantime, it's version needs to be upgraded.
Partition 2 is changed to "R/W". Probably because partition 1 is no longer 
servicing as an "R/W" node.
As for partition 3, all its states are changed.

The difficulties of supporting this request using current Helix system include 
but not limited to the following aspects.

*It is hard to define state machine or transition constraint for all state 
models using the single state model*

For a dynamic state, pre-defined state model won't work at all.

But even we only consider regular state, there is still a problem. Based on our 
existing framework, in order to support such scenario, we will need to create a 
very complex state model that combines all 3 models. The result will be 2 * 3 * 
4 = 24 states and around 80 possible transition paths, which will be super hard 
to code.

*It will be potentially low efficient to do states transition*

Imagine that each state transition message contains the delta of a single 
state. The messages should be as following.

Partitions      State transitions
R1      (Online, R/W, 1.0.1) → (Online, Error, 1.0.1)
        (Online, Error, 1.0.1) → (Online, Error, 1.0.2)
R2      
        (Online, Init, 1.0.1) → (Online, R/W, 1.0.1)
R3      
        (Offline, Init, 1.0.1) → (Online, Init, 1.0.1)
        (Online, Init, 1.0.1) → (Online, Ready, 1.0.1)
        (Online, Ready, 1.0.1) → (Online, Ready, 1.0.2)

Obviously, this strategy increases traffic and make the whole transition 
process much slower.
So a simpler design is that a message carries all necessary information.

Partitions      State transitions
R1      (Online, R/W, 1.0.1) → (Online, Error, 1.0.2)
R2      (Online, Init, 1.0.1) → (Online, R/W, 1.0.1)
R3      (Offline, Init, 1.0.1) → (Online, Ready, 1.0.2)

But this design brings other issues.

# When a participant gets a message, it may report the new states after finish 
all the changes. Among all these states, if one state transition takes a 
considerably longer time than others, the whole process is blocked.
# The controller has less control on how a participant does states transitions. 
It is a problem if any policy like Helix State Transition Priority Support 
needs to be applied.
# On the other hand, the participant needs to check the message and compare 
status. It's hard to ensure backward compatibility.

*Helix is not able to calculate the best possible state for every state model*

With dynamic state, we allow the application to manage state transition. So the 
state model is not defined with a complete constraint and requirement. Helix 
cannot calculate the best possible states.

Moreover, even for a nondynamic state, the application may want to trigger the 
transition based on some external factors. In this case, Helix only coordinates 
the state transition. But it won't make the best possible states plan.

In order to let the user define such states, we need to provide a new state 
model type. And Helix should be able to interpret the definition and generate 
transition messages correctly.

h2. Additional Case Study

h3. Ambry R/W State

In Ambry, a partition has an "R/W" state in addition to OnlineOffline state. So 
the partition can be "ONLINE:READ" or "ONLINE:WRITE".
The "R/W" state is for indicating whether this partition is for read-only or 
writable.
There may be state transitions as shown following.

* The first state transition is conducted by the Ambry application.
* The second one is regular state transition managed by Helix.

!https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=1272&x=13&y=824&w=647&h=352&store=1&accept=image%2F*&auth=LCA%206f398192ee541fa7519801ec33ae2ae4f6e02bef-ts%3D1496770738!

Note that the "R/W" state model is still regular model. Which means the state 
is pre-defined and the constraint will still be defined as a regular state.

h3. Pinot Version State

In Pinot, when a new version of data is ready, the system replaces old 
partitions with the new ones.
If the replacement is done one partition by another, any read that is queried 
during the upgrade period will get inconsistent data.
Currently, the application needs a workaround for data consistency.

* Option 1, creating a new resource with l the test version and replace old 
resource after the new one is ready.
* Option 2, maintaining customized configuration or property store item for 
managing versions inside the application.

So the expected state transitions of a Pinot section is as follows.

!https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=1272&x=18&y=1264&w=647&h=352&store=1&accept=image%2F*&auth=LCA%20b14e05d32645f6bf9f6ef178ca4e55853787f921-ts%3D1496770738!
 
It would be very helpful to extend Helix state transition system to support 
multiple state models.

h2. Proposal

In this document, we propose to extend existing state transition system in 
Helix. Basically, Helix should allow one resource/partition to have more than 
one state. And the states are managed separately based on different state 
models.

States transitions shall follow these rules:

* If only one state is changed, state transition logic keeps the same as what 
we have today.
* States have the different priorities. If more than one states are changed, 
Helix will finish transition one by one based on state model priority. 
Transition messages are sent one after another.
* States may have the dependency. If state B depends on state A, transition on 
state B will require state A's information. And if state A is in error state, 
state B transition will be suspended. Otherwise, independent states transitions 
will not be blocked by each other.
* If the state is managed by the application, Helix won't calculate ideal 
state. The application needs to specify the desired state in resource 
configurations.

h3. State Dependency and Priority

A complete multi-states definition will be a hierarchical system. The states 
are divided into different levels. First tier states are the most important 
ones. And there might be additional second level or third level states related 
to the higher level states. The states in the same level will be independent to 
each other.

For example, Admins may set master/slave (MS) state as the first level state. 
And both R/W state and Version shall depend on MS state.
That means transitions in R/W state or Version will require MS state as the 
input. And if MS state is in error condition, no transition in the other states 
is allowed.
But R/W state and Version can be changed in parallel.

!https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=1841&x=1148&y=409&w=602&h=248&store=1&accept=image%2F*&auth=LCA%2015c88906f795d7ec7f0abc5ab94c072e632b07a1-ts%3D1497894598!

In addition to dependencies, Admins will be able to specify priorities for all 
related state models. Basically, if multiple states are changed concurrently, 
Helix will process high priority state transition first. As shown in the 
following figure, both R/W state and version are the level 2 states. But if 
Admins configure version to have higher priority, Helix will schedule it before 
R/W state.

!https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2379&x=794&y=1242&w=563&h=396&store=1&accept=image%2F*&auth=LCA%20ac3c5982518fe19a45b76afb89af1e8277c1f81a-ts%3D1497894598!
 
h3. Application Managed State and Dynamic State

The nature of the dynamic state makes it an application managed state by 
default. However, not all application managed state is dynamic states.

!https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2379&x=799&y=897&w=693&h=316&store=1&accept=image%2F*&auth=LCA%209e8a39c3311eb426243f288a336f0f90d811198c-ts%3D1497894598!

If we check the state model definition from different aspects, the differences 
between regular state model and new state models are obvious.
Details about dynamic state design, and how to extend current state model 
interface will be discussed as a separate topic. In this document, we only 
consider the simplest design for supporting the basic features. More 
information is discussed in the "Design Details" section.
 
States  Transition Constraint   Next State
Regular state define    Fixed   State Machine   Helix decides new state
Dynamic state define    Dynamic Check based on regex or no check        
Application decides new state
Application managed state define        Both    Both    Application decides new 
state

h3. Multiple State Models vs. Single State Model

Shall we use multiple state models for every state, or defining a large state 
model which is able to handle all states transition?

* In the first option, state models are completely treated equally. So state 
dependencies have to be resolved by Helix. But it's easier for the application 
developers to define these state models.
* In the second option, states relationship can be defined and resolved in the 
state model class. So the management logic will be simplified. But defining 
constraints and state transition rules will be difficult for the application 
developers.

In this design document, we will take the first option for limiting the change 
and ensuring backward compatibility. But we may consider the other option in 
the future.

The whole feature implementation is divided into 2 phases.

# Support secondary states (Described in "First Mile Stone").
# Fully support multi-states with hierarchy structure and all feature support.

h2. The First Milestone

As the first milestone, we plan to add secondary states support as an optional 
feature.

The reason we don't implement the whole feature is one step is:

# Limit change for faster iteration.
# Ensure backward compatible until major version upgrade. For legacy 
participants, they won't be able to handle complicated multi-states transition 
request.

h3. Secondary States

* The secondary states are configured separately but in the same way as the 
main state.
* The secondary states shall have different state models to avoid conflict. 
Also, they should have different state models from the main state model.
* The secondary states will be level 2 states, while the main state is regarded 
as the level 1 state. Admins will be able to configure the secondary states as 
dynamic states. All secondary states have the same priority.
* Helix doesn't calculate ideal state for the secondary states. Only updating 
in the resource configuration will trigger secondary state transition. The 
state model can be a regular one with constraints or dynamic state model.

The following figure demonstrates the workflow of secondary state registration 
and transition.
Note that except transition triggering, other major steps are the same as our 
existing state transition mechanism.

!https://documents.lucidchart.com/documents/5217edea-896d-4ddc-8a93-c31eeb273f38/pages/0_0?a=1342&x=-57&y=36&w=2142&h=968&store=1&accept=image%2F*&auth=LCA%206feef23b2ef39051ecf1880f89cb07ecf5808570-ts%3D1497894601!


was (Author: jiajunwang):
h1. Proposal
In this document, we propose to introduce an additional layer of state 
mechanism into Helix.
Considering Pinot case, what they need is transiting from "ONLINE:V1" to 
"ONLINE:V2". Note that "V1" to "V2" transition is in parallel of the existing 
state transition. It is special in following ways:
# The state is not pre-defined. New version numbers may appear after state 
transition model is registered.
# Helix won't understand the internal logic of this additional state. So there 
is no way that Helix automatically computes idea state. It will rely on 
application's configuration to update this state.

We will take the above 2 points as assumptions.

As for expected workflow, still take Pinot partition version as an example: 
# Pinot needs to register their own logic for version upgrade, which means a 
new state model (factory name).
# Helix provides API to configure resources with additional state ("VERSION").
# Upon resource configuration changed, the controller triggers state transition 
and sends message to the participants.
# Participants handles message by calling corresponding state transition 
methods. Then update in current state.
# Controller listens on current state change. If any update, it processes and 
reflects the update in the external view.
 
h1. Design
h2. Register Associate States Model / Factory
Note that since associate states maybe not pre-defined, so 
defaultTransitionHandler has to be implemented.
h3. State Model Factory:

public abstract class AssociateStateModelFactory extends 
StateModelFactory<AssociateStateModel> {
  ...
}
  
public abstract class AssociateStateModel extends StateModel {
  static final String DEFAULT_INITIAL_STATE = "UNKNOWN";
  protected String _currentState = DEFAULT_INITIAL_STATE;
 
  public String getCurrentState() {
    return _currentState;
  }
 
  // !!!!!!!!!!! Changed part !!!!!!!!!!!! //
  @transition(from='from', to='to')
  public void defaultTransitionHandler(Message message, NotificationContext 
context) {
    logger
      .error("Default transition handler. The idea is to invoke this if no 
transition method is found. To be implemented");
  }
 
  public boolean updateState(String newState) {
    _currentState = newState;
    return true;
  }
 
  public void rollbackOnError(Message message, NotificationContext context,
      StateTransitionError error) {
    logger.error("Default rollback method invoked on error. Error Code: " + 
error.getCode());
  }
 
  public void reset() {
    logger
      .warn("Default reset method invoked. Either because the process longer 
own this resource or session timedout");
  }
 
  @Transition(to = "DROPPED", from = "ERROR")
  public void onBecomeDroppedFromError(Message message, NotificationContext 
context)
      throws Exception {
    logger.info("Default ERROR->DROPPED transition invoked.");
  }
}

h2. Resource Configuration
h3. Resource config with associate state VERSION:

{
  "id":"Test_Resource"
  ,"simpleFields":{
  }
  ,"listFields":{
    "ASSOCIATE_STATE_MODEL_DEF_REFS": [
        "VERSION"
    ],
    "ASSOCIATE_STATE_MODEL_FACTORY_NAMES": [
        "DEFAULT"
    ],
    "ASSOCIATE_STATES": [
        "1.0.1"
    ],
  }
  ,"mapFields":{
  }
}

h2. Additional APIs to configure associate states

 /**
 * Set configuration values
 * @param scope
 * @param properties
 */
void setConfig(HelixConfigScope scope, Map<String, List<String>> 
listProperties);
  
/**
 * Get configuration values
 * @param scope
 * @param keys
 * @return configuration values ordered by the provided keys
 */
Map<String, List<String>> getConfig(HelixConfigScope scope, List<String> keys);

h2. Partition with the Associate States on the Participant State And EV
h3. Current States:

{
  "id":"example_resource"
  ,"simpleFields":{
    "STATE_MODEL_DEF":"MasterSlave"
    ,"STATE_MODEL_FACTORY_NAME":"DEFAULT"
    ,"BUCKET_SIZE":"0"
    ,"SESSION_ID":"25b2ce5dfbde0fa"
  }
  ,"listFields":{
    "ASSOCIATE_STATE_MODEL_DEF_REFS": [
        "VERSION"
    ],
    "ASSOCIATE_STATE_MODEL_FACTORY_NAMES": [
        "DEFAULT"
    ]
  }
  ,"mapFields":{
    "example_resource_0":{
      "CURRENT_STATE":"MASTER"
      "ASSOCIATE_STATES":"1.0.1" // Split by ":" if multiple associate states 
are set
      ,"INFO":""
    }
  }
}

h3. Associate state in External View:

{
  "id":"example_resource"
  ,"simpleFields":{
    ,"STATE_MODEL_DEF_REF":"MasterSlave"
  }
  ,"listFields":{
    "ASSOCIATE_STATE_MODEL_DEF_REFS": [
        "VERSION"
    ]
  }
  ,"mapFields":{
    "example_resource_0":{
      // Given more than one assistant states, they will be split by ":". And 
the main state will always be the first state.
      "lca1-app0004.stg.linkedin.com_11932":"MASTER:1.0.1"
      ,"lca1-app0048.stg.linkedin.com_11932":"SLAVE:1.0.0"
    }
  }
}

h2. Helix Controller Updates
On resource configuration changes:
* Fill ClusterDataCache with associate states and related state models / 
factories from resource configuration.
* Merge associate states to BestPossibleStateOutput.
* Fill associate states and related state models / factories into the message 
before sending to participants.

Note that batching all concurrent states change in one message can help to 
avoid parallel state transitions. And if any error happens,  the processing 
will be stopped immediately, so as to avoid further issue. This also means 
participants should handle multiple state transitions sequentially.
An alternative design is sending separate messages on any of the states' 
change. This design implies that states have no dependency. And there is no 
guarantee that the main state will be handled before other associate states. It 
might be helpful in some conditions. But overall, this alternative design 
brings more risk than benefit.

On participant state changes:
* Besides existing read, also read and fill associate states. Then fill EV with 
complete states information.

h2. Helix Participant Updates
On receiving state transition message:
* Read main state and associate states, trigger state transitions in order.
* Do main state transition first, then do associate states transitions one by 
one.
** If any state transition failed, set an error state to cover all states and 
stop processing. User should fix problem and reset to initial states.
** If state transition succeeds, update current state.

h1. Alternative options
h2. Introducing UPGRADING State for additional state transitions
Adding a new internal state UPGRADING for partition upgrade.
So upgrade will happen when the partition is transited "to" or "from" UPGRADING 
status.
Note that application has the freedom to define whether UPGRADING is a special 
online status or not.
For Pinot case, upgrading partition (even before they are back to ONLINE) might 
be active partition.
The problem of this new state is that it only works fine for a single 
additional state.
Once we have more than one additional state to take care, UPGRADING state is 
not enough.
h2. Rely on resetting partition to load new states
Whenever a new version is available, application update versions for the 
resource. Then resetting all partitions.
Then during state transition from offline to online, participants will read new 
version and apply to the related partitions.
The problem of this method is changing in the additional state will affect the 
main state. A partition will be offline for a while. During this period, even 
old version will be not available.
h2. Application registers message handler to handle upgrading message
In this method, the controller is only responsible for sending upgrade request 
to participants. Participants will be responsible for reporting local 
participant versions.
Since the controller has no clue about how to control the additional state, the 
application will need to process all the logics.
h1. Validation
Add unit tests / integration tests for validate associate states.
Verify Pinot Version use case.

> Extend Helix to Support Resource with Multiple States
> -----------------------------------------------------
>
>                 Key: HELIX-659
>                 URL: https://issues.apache.org/jira/browse/HELIX-659
>             Project: Apache Helix
>          Issue Type: New Feature
>          Components: helix-core
>    Affects Versions: 0.6.x
>            Reporter: Jiajun Wang
>
> h1. Problem Statement
> h2. Single State Model v.s. Multiple State Models
> Currently, Each Helix resource is associated with a single state model, and 
> each replica of a partition can only be in any one of these states defined in 
> the state model at any time. And Helix manages state transition based on the 
> single state model.
> !https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2416&x=-11&y=71&w=517&h=198&store=1&accept=image%2F*&auth=LCA%20313ced8fb855e8fc1a7043f7fe91cdfa15fffb6b-ts%3D1498857664!
> However, in many scenarios, resources could be more complicated to be modeled 
> by a single state model.
> As an example, partitions from a resource could be described in different 
> dimensions: SlaveMaster state, Read or Write state and its versions. They 
> represent different dimensions of the overall resource status. States from 
> each dimension are based on different state models. Note that we have state 
> machines simplified in this document.
> !https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2416&x=-71&y=66&w=1822&h=308&store=1&accept=image%2F*&auth=LCA%2041fa743ba130f41786dee3527de6206cebdd4534-ts%3D1498857664!
> The basic idea is that states in these 3 dimensions are in parallel and can 
> be changed independently. For instance, R/W state may be changed without 
> updating slave/master state.
> h2. Finite State Machine v.s. Dynamic State Model
> In addition, Helix employs finite state machine to define a state model. 
> However, some state model can not be easily modeled by a finite state machine 
> with fixed states, for example, the versions.  We call such state model as 
> the dynamic state model. It is read, set, and understood by the application. 
> We will need to extend Helix to support such dynamic state model. Note that 
> Helix should not and will not be able to calculate the best possible dynamic 
> states.
> The version of a software is one of the best examples to understand dynamic 
> state.
> Let's consider one application that is deployed on multiple nodes, which work 
> together as a cluster. The green node works as the master, and all dark blue 
> nodes are slaves. When Admins upgrades the service from 1.0.0 to 1.1.0, they 
> need to ensure upgrading all nodes to the new version and then claim upgrade 
> is done. After the upgrade process, it is important to ensure that all 
> software versions are consistent.
> If Helix framework is leveraged to support upgrading the cluster, it will 
> help to simplify application logic and ensure consistency. For instance, the 
> service (cluster) itself is regarded as the resource. And each node is mapped 
> as a partition. Then upgrading is simply a state transition. Admins can check 
> external view for ensuring consistency.
> Note that during this version upgrade, the master node is still master node, 
> and slave nodes are still slave nodes. So the version state is parallel to 
> the other states.
> !https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2066&x=1466&y=922&w=560&h=455&store=1&accept=image%2F*&auth=LCA%20fa3d8fc0d113a82f4e94b127161cf91818a2fe64-ts%3D1497894598!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to