Hey Mike,

That context is super helpful. If it is a correctness problem to have
interceptors returning more events than they receive, can I propose
that we:

a) Add a check in InterceptorChain that verifies the interceptor isn't
growing the size of events (better to throw an error here than
somewhere down the line) which will be harder to debug.

b) Explain in the javadoc briefly why it is a correctness issue.

c) Put a note of caution in the user or dev guide for those who want
to build custom interceptors, explaining that they are solely for
transformation and filtering, not event creation (this may exist,
haven't looked closely).

I am happy to do these myself, but do you think this makes sense?

Two other people have asked me off list whether they can do this, so I
think we need to be very clear that his is outside the specification
for interceptors.

- Patrick

On Fri, Aug 10, 2012 at 6:51 PM, Mike Percy <[email protected]> wrote:
> I put that comment there for a few reasons that I can recall off the top of
> my head (I should have done a better job documenting this when I was
> writing the code):
>
> 1. The max transaction size on the channel must currently be manually
> balanced with (or made to exceed) the batchSize setting on batching sources
> and sinks. If the number of events added or taken in a single transaction
> exceeds this maximum size, an exception will be thrown. However, if
> generating multiple events from a single event, it is no longer sufficient
> to make the batchSize less or equal to this value, and it would be easier
> to blow out your transaction size in a potentially unpredictable way,
> causing potentially confusing errors.
>
> 2. An Event is what you might call the basic unit of "flow" in Flume. From
> the perspective of management and monitoring, having the same number of
> events enter and exit the system helps you know that your cluster is
> healthy. OTOH, when you generate a variable number of events from a single
> event in an Interceptor, it is really quite difficult to know how the data
> is flowing.
>
> 3. Since the interceptor typically runs in an I/O worker thread or in the
> only thread in a Source, doing any significant computation there will
> likely affect the overall throughput of the system.
>
> In my view, Interceptors as a generally applicable component are well
> suited to do header "tagging", simple transformations, and filtering, but
> they're not a good place to put batching/un-batching logic. Maybe the Exec
> Source should have a line-parsing plugin interface to allow people to take
> text lines and generate Events from them. I know this seems similar to the
> Interceptor in the context of the data flow, but I believe you are just
> trying to work around a limitation of the exec source, since it appears
> you're describing a serialization issue.
>
> Alternatively, one could use an HBase serializer to generate multiple
> increment / decrement operations, and just log the original line in HDFS
> (or use an EventSerializer).
>
> Regards,
> Mike
>
> On Fri, Aug 10, 2012 at 5:15 PM, Patrick Wendell <[email protected]> wrote:
>
>> to clarify - I mean I think it's within the scope of the design
>> intentions. I agree that it is currently disallowed (at least in
>> documentation).
>>
>> On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[email protected]>
>> wrote:
>> > Hey Jeremy,
>> >
>> > That comment has been in the code now for some time, but I don't think
>> > it is actually enforced anywhere programatically. I think the idea was
>> > just that if you are writing something which is capable of generating
>> > new event data it should be in a source - though I'm also curious to
>> > hear why this was put in there.
>> >
>> > IMHO, doing some type of event splitting seems within the scope of how
>> > interceptors are used.
>> >
>> > - Patrick
>> >
>> > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder
>> > <[email protected]> wrote:
>> >> Hello All,
>> >>
>> >> I'm wondering if you could provide some guidance for me. One of the
>> >> inputs I'm working with batches several entries to a single event.
>> >> This is a lot simpler than my data but it provides an easy example.
>> >> For example:
>> >>
>> >> timestamp - 5,4,3,2,1
>> >> timestamp - 9,7,5,5,6
>> >>
>> >> If I tail the file this results in 2 events being generated. This
>> >> example has the data for 10 events.
>> >>
>> >> Here is high level what I want to accomplish.
>> >> (web server - agent 1)
>> >> exec source tail -f /<some file path>
>> >> collector-client to (agent 2)
>> >>
>> >> (collector - agent 2)
>> >> collector-server
>> >> Custom Interceptor (input 1 event, output n events)
>> >> Multiplex to
>> >> hdfs
>> >> hbase
>> >>
>> >> An interceptor looked like the most logical spot for me to add this.
>> >> Is there a better place to add this functionality? Has anyone run into
>> >> a similar case?
>> >>
>> >> Looking at the docs for Interceptor. intercept(List<Event> events) it
>> >> says "Output list of events. The size of output list MUST NOT BE
>> >> GREATER than the size of the input list (i.e. transformation and
>> >> removal ONLY)." which tells me not to emit more events than given.
>> >> intercept(Event event) only returns a single event so I can't use it
>> >> there either. Why is there a requirement to only return 1 for 1?
>> >>
>> >> For now I'm implementing a custom source that will handle generating
>> >> multiple events from the events coming in on the web server. My
>> >> preference was to do this transformation on the collector agent before
>> >> I hand off to hdfs and hbase. I know another alternative would be to
>> >> implement custom RPC but I would prefer not to do that. I would prefer
>> >> to rely on what is currently available.
>> >>
>> >> Thanks!
>> >> j
>>

Reply via email to