[
https://issues.apache.org/jira/browse/CONNECTORS-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826293#comment-13826293
]
Karl Wright commented on CONNECTORS-781:
----------------------------------------
Making good progress; aside from potential bugs (1) is done, except for minor
details. However, now I think we have to consider how cluster operations will
actually work for failover etc.
Basically, the way I've structured it, when each node undertakes a transient
operation, the database keeps track of which node did that. Recovery from a
failure of the node therefore requires either:
(a) Restart of the failed node
(b) Cleanup of the failed node's dangling bits (not currently any direct
utility to do that, but it could be easily created)
(c) Cleanup of the whole cluster (involving global shutdown, cleanup of
cluster-wide dangling bits, which would also need a direct utility, and restart
of new nodes)
My problem is that I'm not sure what the best model is for completing this kind
of failover. Critically, failure of a node requires action for the cluster to
become fully functional again. (It will remain functional in many ways, but
the transient actions the failed node was doing will be suspended, and no other
nodes will be able to pick them up until the cleanup occurs.) Other nodes
might be able to fire off cleanup in background, but then they'd need to
reliably know that the other node had died and wasn't coming back.
Coordinating what nodes are active through ZooKeeper might allow us to detect
when a node has fallen off the grid. But it would have to be a two-way street;
the node will have to know enough to shut itself down rather than rejoin should
connection be re-established. There's also an issue of what to do if the node
undertaking the cleanup fails; sooner or later we could get to the state where
we've still got dangling stuff and all remaining nodes have lost knowledge of
the prior existence of the nodes they need to clean up after.
I have a vague idea of two ZooKeeper nodes, one which has a persistent child
for every agents node that started, and one that has transient children
representing every agents node that is currently alive. It should be possible
for any surviving or restarted node, therefore, to undertake appropriate
cleanup operations based on this information.
Any comments or suggestions?
> Fault-Tolerant Setup for ManifoldCF Agent.
> ------------------------------------------
>
> Key: CONNECTORS-781
> URL: https://issues.apache.org/jira/browse/CONNECTORS-781
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Framework agents process, Framework core, Framework
> crawler agent
> Affects Versions: ManifoldCF 1.5
> Reporter: Swami Rajamohan
> Assignee: Karl Wright
> Labels: agents, crawler, fault-tolerance
> Fix For: ManifoldCF 1.5
>
>
> It should be possible to setup ManifoldCF as a Fault-Tolerant infrastructure.
> The Agent component of ManifoldCF should support multiple instances of an
> agent crawling against a single crawl store, to be able to both distribute
> (share) the crawl load as well as to be able to pick up a request that gets
> abruptly terminated due to either partitioning of the instance/failure of the
> instance itself.
> Since there is a proposal to move to a store like Voldemort, it would be nice
> to be able to have a fault tolerant infrastructure.
--
This message was sent by Atlassian JIRA
(v6.1#6144)