[jira] [Commented] (FLUME-1045) Proposal to support disk based spooling

Arvind Prabhakar (Commented) (JIRA) Wed, 28 Mar 2012 22:30:14 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240989#comment-13240989
 ]


Arvind Prabhakar commented on FLUME-1045:
-----------------------------------------

bq.  Curious to know what is the right way in current Flume architecture to 
trade off transactional guarantees with very high thruput system; providing 
certain degree of reliability incase the next link is down ?

This seems to be a confusion between design and implementation. The design 
_requires_ that the channel expose transactional semantics. The channel 
implementation _decides_ the degree of implementation. For example, the 
transactional semantics implemented by the JDBC channel are very strict, 
whereas that implemented by the Memory channel are weak. 

However, since the design requires both these channels to expose transactional 
semantics, you can switch the channels to suite your flow needs.

The solution being discussed here - disk based spooling on the sink side - goes 
outside the scope of this design to accommodate throughput requirements. If 
implemented, the messages that are spooled will be outside of the transaction 
boundary and thus will invalidate the safety guarantee of the system.

bq. One of the solution which I think of where the IO cost is incurred on only 
failures and still things are transactional: Wrap the MemoryChannel and 
FileChannel into a new channel say SpoolingMemoryChannel. Events flow via 
memory channel; on reaching the buffer capacity of memory channel, events are 
spooled into FileChannel. Since the underlying channels are transactional, 
SpoolingMemoryChannel can also be easily made transactional.

This sounds like a promising solution. The key thing to watch out here is the 
ordering requirement. In general, channels are expected to preserve the order 
of events. As long as that is take care of and the transactional semantics make 
sense, then it could be the stop-gap solution until we have a high-throughput 
file based channel implemented.


                
> Proposal to support disk based spooling
> ---------------------------------------
>
>                 Key: FLUME-1045
>                 URL: https://issues.apache.org/jira/browse/FLUME-1045
>             Project: Flume
>          Issue Type: New Feature
>    Affects Versions: v1.0.0
>            Reporter: Inder SIngh
>            Priority: Minor
>              Labels: patch
>         Attachments: FLUME-1045-1.patch, FLUME-1045-2.patch
>
>
> 1. Problem Description 
> A sink being unavailable at any stage in the pipeline causes it to back-off 
> and retry after a while. Channel's associated with such sinks start buffering 
> data with the caveat that if you are using a memory channel it can result in 
> a domino effect on the entire pipeline. There could be legitimate down times 
> eg: HDFS sink being down for name node maintenance, hadoop upgrades. 
> 2. Why not use a durable channel (JDBC, FileChannel)?
> Want high throughput and support sink down times as a first class use-case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1045) Proposal to support disk based spooling

Reply via email to