[jira] [Commented] (CONNECTORS-781) Fault-Tolerant Setup for ManifoldCF Agent.

Karl Wright (JIRA) Mon, 18 Nov 2013 23:56:17 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826293#comment-13826293
 ]


Karl Wright commented on CONNECTORS-781:
----------------------------------------

Making good progress; aside from potential bugs (1) is done, except for minor 
details.  However, now I think we have to consider how cluster operations will 
actually work for failover etc.

Basically, the way I've structured it, when each node undertakes a transient 
operation, the database keeps track of which node did that.  Recovery from a 
failure of the node therefore requires either:

(a) Restart of the failed node
(b) Cleanup of the failed node's dangling bits (not currently any direct 
utility to do that, but it could be easily created)
(c) Cleanup of the whole cluster (involving global shutdown, cleanup of 
cluster-wide dangling bits, which would also need a direct utility, and restart 
of new nodes)

My problem is that I'm not sure what the best model is for completing this kind 
of failover.  Critically, failure of a node requires action for the cluster to 
become fully functional again.  (It will remain functional in many ways, but 
the transient actions the failed node was doing will be suspended, and no other 
nodes will be able to pick them up until the cleanup occurs.)  Other nodes 
might be able to fire off cleanup in background, but then they'd need to 
reliably know that the other node had died and wasn't coming back.

Coordinating what nodes are active through ZooKeeper might allow us to detect 
when a node has fallen off the grid.  But it would have to be a two-way street; 
the node will have to know enough to shut itself down rather than rejoin should 
connection be re-established.  There's also an issue of what to do if the node 
undertaking the cleanup fails; sooner or later we could get to the state where 
we've still got dangling stuff and all remaining nodes have lost knowledge of 
the prior existence of the nodes they need to clean up after.

I have a vague idea of two ZooKeeper nodes, one which has a persistent child 
for every agents node that started, and one that has transient children 
representing every agents node that is currently alive.  It should be possible 
for any surviving or restarted node, therefore, to undertake appropriate 
cleanup operations based on this information.

Any comments or suggestions?


> Fault-Tolerant Setup for ManifoldCF Agent.
> ------------------------------------------
>
>                 Key: CONNECTORS-781
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-781
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process, Framework core, Framework 
> crawler agent
>    Affects Versions: ManifoldCF 1.5
>            Reporter: Swami Rajamohan
>            Assignee: Karl Wright
>              Labels: agents, crawler, fault-tolerance
>             Fix For: ManifoldCF 1.5
>
>
> It should be possible to setup ManifoldCF as a Fault-Tolerant infrastructure.
> The Agent component of ManifoldCF should support multiple instances of an 
> agent crawling against a single crawl store, to be able to both distribute 
> (share) the crawl load as well as to be able to pick up a request that gets 
> abruptly terminated due to either partitioning of the instance/failure of the 
> instance itself.
> Since there is a proposal to move to a store like Voldemort, it would be nice 
> to be able to have a fault tolerant infrastructure.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CONNECTORS-781) Fault-Tolerant Setup for ManifoldCF Agent.

Reply via email to