[ 
https://issues.apache.org/jira/browse/FLUME-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504341#comment-13504341
 ] 

Mike Percy commented on FLUME-1227:
-----------------------------------

Hi Roshan,
At this time, we don't have any channels that know about other channels. 
Likewise for sources and sinks - we don't provide an API to get a handle to a 
different component. This decoupling is important for reasons related to 
predictability at deploy time, maintenance (scope of expertise), and debugging. 
I don't think we should break that.

Personally I believe a Mem-FC specialized spillable channel would be easier for 
users to understand and for developers to debug if there were problems. That 
said, if you want to work on a compound channel for the reasons you mentioned 
we would need to keep all the configuration of the sub-channels within the 
compound channel config namespace. The compound channel would have to be 
responsible for the lifecycle of those underlying objects, etc. That way it 
will not need a handle into the global config namespace and we remain free to 
evolve the implementation of Flume's configuration system over time. Example of 
what I mean:

{noformat}
agent1.channels.compoundChannel.type = compound
agent1.channels.compoundChannel.channels = primary overflow
agent1.channels.compoundChannel.channels.primary.type = MEMORY
agent1.channels.compoundChannel.channels.primary.capacity = 100000
agent1.channels.compoundChannel.channels.primary.transactionCapacity = 10000
agent1.channels.compoundChannel.channels.overflow.type = FILE
agent1.channels.compoundChannel.channels.overflow.capacity = 10000000
agent1.channels.compoundChannel.channels.overflow.transactionCapacity = 10000
{noformat}

Regarding how to keep track of which events are in which underlying channels, 
I'd recommend using an algorithm instead of an explicit mapping. Otherwise to 
maintain correct ordering in a failure scenario with durable channels, you must 
make your mapping durable, which will make the implementation more complex and 
likely slower. Such an algorithm could be something like:

Puts:
# If overflow channel (OC) is empty, put() to primary channel (PC). Otherwise, 
put() to the OC.

Takes:
# If PC is not empty: take() from the PC.
# If PC is empty and OC is not empty, take() as large a batch as possible from 
the OC and put() it onto the PC. Then go back to #1.

The potentially tricky bit about this algorithm is dealing with the 
transactions. You have to make sure you don't do a put() and a take() in the 
same transaction (same thread) on the same underlying channel, since that's not 
supported by all of the channel implementations. So if you empty your primary 
with take()s then you have to just take() from the overflow until the end of 
that transaction. My idea about moving big batches from overflow to primary 
would have to happen at the beginning of the outer transaction only, and it's 
basically just an optimization...

Finally, if we were going this route, for an initial implementation I'd 
recommend only supporting a primary and a secondary, not a whole chain of 
fallbacks, again to keep the design & testing surface area simple at first.

                
> Introduce some sort of SpillableChannel
> ---------------------------------------
>
>                 Key: FLUME-1227
>                 URL: https://issues.apache.org/jira/browse/FLUME-1227
>             Project: Flume
>          Issue Type: New Feature
>          Components: Channel
>            Reporter: Jarek Jarcec Cecho
>
> I would like to introduce new channel that would behave similarly as scribe 
> (https://github.com/facebook/scribe). It would be something between memory 
> and file channel. Input events would be saved directly to the memory (only) 
> and would be served from there. In case that the memory would be full, we 
> would outsource the events to file.
> Let me describe the use case behind this request. We have plenty of frontend 
> servers that are generating events. We want to send all events to just 
> limited number of machines from where we would send the data to HDFS (some 
> sort of staging layer). Reason for this second layer is our need to decouple 
> event aggregation and front end code to separate machines. Using memory 
> channel is fully sufficient as we can survive lost of some portion of the 
> events. However in order to sustain maintenance windows or networking issues 
> we would have to end up with a lot of memory assigned to those "staging" 
> machines. Referenced "scribe" is dealing with this problem by implementing 
> following logic - events are saved in memory similarly as our MemoryChannel. 
> However in case that the memory gets full (because of maintenance, networking 
> issues, ...) it will spill data to disk where they will be sitting until 
> everything start working again.
> I would like to introduce channel that would implement similar logic. It's 
> durability guarantees would be same as MemoryChannel - in case that someone 
> would remove power cord, this channel would lose data. Based on the 
> discussion in FLUME-1201, I would propose to have the implementation 
> completely independent on any other channel internal code.
> Jarcec

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to