[ 
https://issues.apache.org/jira/browse/HELIX-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayak Borkar updated HELIX-595:
---------------------------------
    Description: 
In my setup I have a resource that has about 160 partitions. The resource uses 
the MasterSlave state model. The partitions have been configured to have just 1 
replica. For some partitions (about 5), I am observing that there are two 
replicas, one in MASTER mode and one in SLAVE mode. In addition, I am observing 
an imbalance with respect to the MASTER replica placement on the machines I 
have.

In discussions with Kishore, the conclusion was that there is a deadlock 
occurring as Helix makes state transition to rebalance the imbalance, and 
reaching a state where any further transition would violate the constraints of 
the state model.

The MasterSlave state model allows at most one MASTER and at most R SLAVES (in 
my case R = 1).

Say the current MASTER of a partition is on hostA, but Helix wants to move it 
to hostB. Helix would run the following transitions:

hostA: t1(M -> S), t2(S -> O)
hostB: t3(O -> S), t4(S -> M)

If t1 and t2 happen before t3, then eventually, helix would achieve the correct 
placement of the master on hostB. However, if t3 runs first, then hostB will 
have a SLAVE of the partition while hostA still has MASTERship. Once this 
happens, every transition that needs to be performed violates a state machine 
constraint. So we end up with a MASTER on hostA and a SLAVE on hostB for this 
partition.

You can find the ZK logs corresponding to the MESSAGES for such a partition 
here: http://pastebin.com/zqqSk4MA

Please let me know what other details would be necessary to get to the bottom 
of this issue.

  was:
In my setup I have a resource that has about 160 partitions. The resource uses 
the MasterSlave state model. The partitions have been configured to have just 1 
replica. For some partitions (about 5), I am observing that there are two 
replicas, one in MASTER mode and one in SLAVE mode. In addition, I am observing 
an imbalance with respect to the MASTER replica placement on the machines I 
have.

In discussions with Kishore, the conclusion was that there is a deadlock 
occurring as Helix makes state transition to rebalance the imbalance, and 
reaching a state where any further transition would violate the constraints of 
the state model.

The MasterSlave state model allows at most one MASTER and at most R SLAVES (in 
my case R = 1).

Say the current MASTER of a partition is on hostA, but Helix wants to move it 
to hostB. Helix would run the following transitions:

hostA: t1(M -> S), t2(S -> O)
hostB: t3(O -> S), t4(S -> M)

If t1 and t2 happen before t3, then eventually, helix would achieve the correct 
placement of the master on hostB. However, if t3 runs first, then hostB will 
have a SLAVE of the partition while hostA still have MASTERship. Once this 
happens, every transition that needs to be performed violates a state machine 
constraint. So we end up with a MASTER on hostA and a SLAVE on hostB for this 
partition.

You can find the ZK logs corresponding to the MESSAGES for such a partition 
here: http://pastebin.com/zqqSk4MA

Please let me know what other details would be necessary to get to the bottom 
of this issue.


> Possible deadlock in state transition sequence
> ----------------------------------------------
>
>                 Key: HELIX-595
>                 URL: https://issues.apache.org/jira/browse/HELIX-595
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Vinayak Borkar
>
> In my setup I have a resource that has about 160 partitions. The resource 
> uses the MasterSlave state model. The partitions have been configured to have 
> just 1 replica. For some partitions (about 5), I am observing that there are 
> two replicas, one in MASTER mode and one in SLAVE mode. In addition, I am 
> observing an imbalance with respect to the MASTER replica placement on the 
> machines I have.
> In discussions with Kishore, the conclusion was that there is a deadlock 
> occurring as Helix makes state transition to rebalance the imbalance, and 
> reaching a state where any further transition would violate the constraints 
> of the state model.
> The MasterSlave state model allows at most one MASTER and at most R SLAVES 
> (in my case R = 1).
> Say the current MASTER of a partition is on hostA, but Helix wants to move it 
> to hostB. Helix would run the following transitions:
> hostA: t1(M -> S), t2(S -> O)
> hostB: t3(O -> S), t4(S -> M)
> If t1 and t2 happen before t3, then eventually, helix would achieve the 
> correct placement of the master on hostB. However, if t3 runs first, then 
> hostB will have a SLAVE of the partition while hostA still has MASTERship. 
> Once this happens, every transition that needs to be performed violates a 
> state machine constraint. So we end up with a MASTER on hostA and a SLAVE on 
> hostB for this partition.
> You can find the ZK logs corresponding to the MESSAGES for such a partition 
> here: http://pastebin.com/zqqSk4MA
> Please let me know what other details would be necessary to get to the bottom 
> of this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to