Flumers:

Flume 0.9.x supported online reconfiguration and the intention was for the
1.x branch to do so as well (it doesn't yet). I wanted to start a
discussion around whether people are interested in this kind of
functionality or if simply restarting the daemon(s) was sufficient for your
deployment. There are two ways of thinking about it:

* Support reconfiguration. Agents may have multiple flows passing through
them and, ideally, adding new ones shouldn't interrupt existing flows.
Agent restarts interrupt collection and, for non-durable channels (i.e.
MemoryChannel), data *may* be lost. Reconfiguration will add significant
complexity and ultimately does not get around host level maintenance,
software upgrades, and the like.

* Do no support reconfiguration. Accept the fact that agents may go down
eventually, so it should be supported as a first class case. In other
words, embrace the idea of failure / maintenance and handle it by
recommending topologies of agents that include multiple agents at each tier
and simply roundrobin / failover where necessary. The only downside is the
agent tier closest to the originating source (e.g. a log4j client);
restarting that agent means the client application needs to be able to find
another agent or buffer (which impacts durability or blocks the
application).

We can optionally support some subset of online reconfiguration such as
only allowing new flows to be introduced or existing flows to be
"decommissioned," but not allow alteration of existing flows. Ultimately
this feature is a ton of work and adds a ton of complexity so if it's not
something folks are clambering for, we should spend our time worrying about
more pressing issues.

Thoughts? Comments?
-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Reply via email to