FLUME-1507. Have "Topology Design Considerations" in User Guide.

(Patrick Wendell via Jarek Jarcec Cecho)


Project: http://git-wip-us.apache.org/repos/asf/flume/repo
Commit: http://git-wip-us.apache.org/repos/asf/flume/commit/cce9a27e
Tree: http://git-wip-us.apache.org/repos/asf/flume/tree/cce9a27e
Diff: http://git-wip-us.apache.org/repos/asf/flume/diff/cce9a27e

Branch: refs/heads/cdh-1.2.0+24_intuit
Commit: cce9a27e5007e75296a1f510279a1fab1e33f81d
Parents: 26c0893
Author: Jarek Jarcec Cecho <[email protected]>
Authored: Mon Sep 3 08:20:46 2012 +0200
Committer: Mike Percy <[email protected]>
Committed: Fri Sep 7 14:03:06 2012 -0700

----------------------------------------------------------------------
 flume-ng-doc/sphinx/FlumeUserGuide.rst |   90 +++++++++++++++++++++++++++
 1 files changed, 90 insertions(+), 0 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flume/blob/cce9a27e/flume-ng-doc/sphinx/FlumeUserGuide.rst
----------------------------------------------------------------------
diff --git a/flume-ng-doc/sphinx/FlumeUserGuide.rst 
b/flume-ng-doc/sphinx/FlumeUserGuide.rst
index e4720e0..1fb549b 100644
--- a/flume-ng-doc/sphinx/FlumeUserGuide.rst
+++ b/flume-ng-doc/sphinx/FlumeUserGuide.rst
@@ -1943,6 +1943,96 @@ Property Name            Default  Description
 =======================  =======  ========================================
 
 
+Topology Design Considerations
+==============================
+Flume is very flexible and allows a large range of possible deployment
+scenarios. If you plan to use Flume in a large, production deployment, it is
+prudent to spend some time thinking about how to express your problem in
+terms of a Flume topology. This section covers a few considerations.
+
+Is Flume a good fit for your problem?
+-------------------------------------
+If you need to ingest textual log data into Hadoop/HDFS then Flume is the
+right fit for your problem, full stop. For other use cases, here are some
+guidelines:
+
+Flume is designed to transport and ingest regularly generated event data over
+relatively stable, potentially complex topologies. The notion of "event data"
+is very broadly defined. To Flume, an event is just a generic blob of bytes.
+There are some limitations on how large an event can be - for instance, it
+cannot be larger than you can store in memory or on disk on a single machine -
+but in practice flume events can be everything from textual log entries to
+image files. The key property of an event  is that they are generated in a
+continuous, streaming fashion. If your data is not regularly generated
+(i.e. you are trying to do a single bulk load of data into a Hadoop cluster)
+then Flume will still work, but it is probably overkill for your situation.
+Flume likes relatively stable topologies. Your topologies do not need to be
+immutable, because Flume can deal with changes in topology without losing data
+and can also tolerate periodic reconfiguration due to fail-over or
+provisioning. It probably won't work well if you plant to change topologies
+every day, because reconfiguration takes some thought and overhead.
+
+Flow reliability in Flume
+-------------------------
+The reliability of a Flume flow depends on several factors. By adjusting these
+factors, you can achieve a wide array of reliability options with Flume.
+
+**What type of channel you use.** Flume has both durable channels (those which
+will persist data to disk) and non durable channels (those which will lose
+data if a machine fails). Durable channels use disk-based storage, and data
+stored in such channels will persist across machine restarts or non
+disk-related failures.
+
+**Whether your channels are sufficiently provisioned for the workload.** 
Channels
+in Flume act as buffers at various hops. These buffers have a fixed capacity,
+and once that capacity is full you will create back pressure on earlier points
+in the flow. If this pressure propagates to the source of the flow, Flume will
+become unavailable and may lose data.
+
+**Whether you use redundant topologies.** Flume let's you replicate flows
+across redundant topologies. This can provide a very easy source of fault
+tolerance and one which is overcomes both disk or machine failures. 
+
+*The best way to think about reliability in a Flume topology is to consider
+various failure scenarios and their outcomes.* What happens if a disk fails?
+What happens if a machine fails? What happens if your terminal sink
+(e.g. HDFS) goes down for some time and you have back pressure? The space of
+possible designs is huge, but the underlying questions you need to ask are
+just a handful.
+
+Flume topology design
+---------------------
+The first step in designing a Flume topology is to enumerate all sources
+and destinations (terminal sinks) for your data. These will define the edge
+points of your topology. The next consideration is whether to introduce
+intermediate aggregation tiers or event routing. If you are collecting data
+form a large number of sources, it can be helpful to aggregate the data in
+order to simplify ingestion at the terminal sink. An aggregation tier can
+also smooth out burstiness from sources or unavailability at sinks, by
+acting as a buffer. If you are routing data between different locations,
+you may also want to split flows at various points: this creates
+sub-topologies which may themselves include aggregation points.
+
+Sizing a Flume deployment
+-------------------------
+Once you have an idea of what your topology will look like, the next question
+is how much hardware and networking capacity is needed. This starts by
+quantifying how much data you generate. That is not always
+a simple task! Most data streams are bursty (for instance, due to diurnal
+patterns) and potentially unpredictable. A good starting point is to think
+about the maximum throughput you'll have in each tier of the topology, both
+in terms of *events per second* and *bytes per second*. Once you know the
+required throughput of a given tier, you can calulate a lower bound on how many
+nodes you require for that tier. To determine attainable throughput, it's
+best to experiment with Flume on your hardware, using synthetic or sampled
+event data. In general, disk-based channels
+should get 10's of MB/s and memory based channels should get 100's of MB/s or
+more. Performance will vary widely, however depending on hardware and
+operating environment.
+
+Sizing aggregate throughput gives you a lower bound on the number of nodes
+you will need to each tier. There are several reasons to have additional
+nodes, such as increased redundancy and better ability to absorb bursts in 
load.
 
 Troubleshooting
 ===============

Reply via email to