[
https://issues.apache.org/jira/browse/FLUME-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430846#comment-13430846
]
Patrick Wendell commented on FLUME-1045:
----------------------------------------
Hey I'm just getting caught up on this discussion. One issue (or
misunderstanding) that I have with Sharad's proposal, and any of the proposals
that seem to suggest a composed "MemoryChannel + File Channel" is what we want
here, is that the existing FileChannel has certain transaction guarantees that
you would not want in this case.
If you are running a memory channel and you want to spill over to disk, you are
already accepting "best effort" delivery semantics for the normal case where
all of the data is fitting in memory.
If our spillover implementation directly uses, or functionally mirrors, the
existing FileChannel, we'll be offering much stronger semantics once the data
has spilled over to disk, at a high throughput cost.
For instance, the FileChannel flushes to disk on every transaction to avoid
data loss. If we were to build a disk-spilling extension to the existing
MemoryChannel, we'd likely want to batch these disk flushes to make the
aggregate disk throughput better. We just wouldn't want the strong semantics
offered by the FileChannel.
That is why I think that just extending the Memory Channel to have some type of
best effort disk spilling would be best, since it differs in fundamental ways
from what is accomplished with the FileChannel.
> Proposal to support disk based spooling
> ---------------------------------------
>
> Key: FLUME-1045
> URL: https://issues.apache.org/jira/browse/FLUME-1045
> Project: Flume
> Issue Type: New Feature
> Affects Versions: v1.0.0
> Reporter: Inder SIngh
> Priority: Minor
> Labels: patch
> Attachments: FLUME-1045-1.patch, FLUME-1045-2.patch
>
>
> 1. Problem Description
> A sink being unavailable at any stage in the pipeline causes it to back-off
> and retry after a while. Channel's associated with such sinks start buffering
> data with the caveat that if you are using a memory channel it can result in
> a domino effect on the entire pipeline. There could be legitimate down times
> eg: HDFS sink being down for name node maintenance, hadoop upgrades.
> 2. Why not use a durable channel (JDBC, FileChannel)?
> Want high throughput and support sink down times as a first class use-case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira