[jira] [Commented] (FLUME-1045) Proposal to support disk based spooling

Inder SIngh (Commented) (JIRA) Thu, 22 Mar 2012 02:34:50 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235479#comment-13235479
 ]


Inder SIngh commented on FLUME-1045:
------------------------------------

Proposed Solution
------------------

Sink triggered spooling
----------------------------
A sink going down/all sinks go down in a failover policy setup triggers 
spooling of data from the channel to local disk. As and when there is a 
successful commit from the channel to one of the sinks a de-spool is triggered 
from local disk to channel.

Proposed Implementation
---------------------------

1. SpooledFailoverSinkProcessor – extending from FailoverSinkProcessor. 
Capabilities include triggering spool(), despool() when the sink go down and 
comes up respectively.

Some more design choices & assumptions
----------------------------------------
1. Persist avro serialized objects in local disk which preserves data & headers.
2. Use channel based transaction semantics while spooling to avoid any data 
loss.
3. Spool location is configurable for each SinkGroup controlled by “spool-dir". 
 Event’s will be spooled in batches controlled by “spool-batch-size “ Spool 
files will be rolled over after they reach a size controlled by 
“spoolfile-size”.
4. Validation to avoid misconfiguration of overlapping spool locations across 
SinkGroups.
5. De-spooling happens one file at a time to avoid the complexity of persisting 
offsets in the first cut.

                
> Proposal to support disk based spooling
> ---------------------------------------
>
>                 Key: FLUME-1045
>                 URL: https://issues.apache.org/jira/browse/FLUME-1045
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: v1.0.0
>            Reporter: Inder SIngh
>            Priority: Minor
>              Labels: patch
>
> 1. Problem Description 
> A sink being unavailable at any stage in the pipeline causes it to back-off 
> and retry after a while. Channel's associated with such sinks start buffering 
> data with the caveat that if you are using a memory channel it can result in 
> a domino effect on the entire pipeline. There could be legitimate down times 
> eg: HDFS sink being down for name node maintenance, hadoop upgrades. 
> 2. Why not use a durable channel (JDBC, FileChannel)?
> Want high throughput and support sink down times as a first class use-case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1045) Proposal to support disk based spooling

Reply via email to