I manually brought down Zookeeper and erased the Zookeeper data on purpose as a test. The goal is to find a way to 1)continue processing in the event of total Helix/Zookeeper failure and lost of state. 2)recovery gracefully once Helix/Zookeeper are restarted.
On Mar 4, 2013, at 12:44 AM, kishore g <[email protected]> wrote: > Hi Ming, > > Helix depends on the data in zookeeper. Its ok for zookeeper to restart and > Helix will handle it but if zookeeper loses its state( data directory) then > unfortunately we cannot recover the state. > > How did you lose the zookeeper cluster ( including state ). > > thanks, > Kishore G > > > > > On Sun, Mar 3, 2013 at 8:58 PM, Ming Fang <[email protected]> wrote: > Hi > > When I have a working Helix cluster, all participants for working fine, and > for whatever reason I lost the entire Zookeeper cluster(including all state), > what is the best way to handle this? > > Ideally I want all the participants to continue working and that the only > capability I would loose is Helix's ability to failover. > Upon restart of Zookeeper, the Controllers and Participants should register > their latest state back to the new Zookeeper cluster. > However my tests thus far shows that even thought the HelixManager > reconnects, they do not write the necessary data into Zookeeper for the > cluster to function correctly. > For example, the external view callbacks are not showing the participants at > all. > > Is this something Helix should handle or is it up to the applications to > detect the failure and then recreate new HelixManagers? > > Thanks > --ming >
