[
https://issues.apache.org/jira/browse/HELIX-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kishore gopalakrishna updated HELIX-26:
---------------------------------------
Fix Version/s: 0.6.1-incubating
> Better support for handling network partition and process freeze
> ----------------------------------------------------------------
>
> Key: HELIX-26
> URL: https://issues.apache.org/jira/browse/HELIX-26
> Project: Apache Helix
> Issue Type: Improvement
> Reporter: kishore gopalakrishna
> Fix For: 0.6.1-incubating
>
>
> Handling network partition is tricky in distributed systems. Zookeeper allows
> us to solve this upto some degree with the use of heart beat. But this is not
> sufficient in large scale systems with many nodes. One of the problems is
> that once the client detects disconnect which happens on the client side, the
> options are
> 1. Put your self in a pause state until you reconnect.
> 2. Continue what ever you are doing until notified of session expiry.
> Unfortunately 1 is too agressive and 2 is too passive. Since Helix comes with
> the centralized controller, its possible to have a more middle ground
> solution where once the participant receives a disconnect event, it can check
> with co-ordinator(s)/peers to check if it can continue operating.
> The challenge here for the node to detect if it belongs to the same partition
> as of the co-ordinator or not. So its goal is to reach the controller, if it
> cannot reach the controller it has to disable/fence itself.
> As of now Helix simply provides the state if its disconnected from the
> cluster and user can either chose 1) or 2).
> This JIRA aims to investigate better ways to enhance network partition
> detection.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira