On Mon, Aug 13, 2012 at 1:55 PM, Mike Percy <[email protected]> wrote:
> Hi Jeremy,
>
> On Mon, Aug 13, 2012 at 9:55 AM, Jeremy Custenborder <
> [email protected]> wrote:
>>
>>
>> > I believe you are just
>> > trying to work around a limitation of the exec source, since it appears
>> > you're describing a serialization issue."
>>
>> > Alternatively, one could use an HBase serializer to generate multiple
>> > increment / decrement operations, and just log the original line in HDFS
>> > (or use an EventSerializer).
>>
>> The is what I'm working towards. I want a 1 for 1 entry in hdfs but
>> increment counters in hbase
>>
>
> HBase serializer can generate multiple operations per Event, and the HDFS
> serializer could generate whatever output Hive expects as well.
>

Yea

>
>> Given this I was just planning on emitting an event with the body I
>> was going to use in hive early in the pipeline. Send the same data to
>> hdfs and hbase. Then use a serializer on the hbase side to increment
>> the counters. This would allow me to add data to hdfs in the format
>> I'm planning on consuming it with without managing two serializers. My
>> plans for the hbase serializer was literally generate key, increment
>> per record based on the input. So only a couple lines of code.
>>
>
> Yeah, if you are doing much parsing in your serializers it's going to be a
> bit more complex.
>
>  > I pondered this a bit over the last day or so and I'm kind of lukewarm on
>> > adding preconditions checks at this time. The reason I didn't do it
>> > initially is that while I wanted a particular contract for that
>> component,
>> > in order to make Interceptors viable to maintain and understand with the
>> > current design of the Flume core, I wasn't sure if it would be sufficient
>> > for all future use cases. So if someone wants to do something that breaks
>> > that contract, then they are "on their own", doing stuff that may break
>> in
>> > future implementations. If they're willing to accept that risk then they
>> > have the freedom to maybe do something novel and awesome, which might
>> > prompt us to add a different kind of extension mechanism in the future to
>> > support whatever that use case is.
>>
>> I think there should be an approved method for this case. A different
>> extension that could perform processing like this could be helpful. To
>> me when I looked at an interceptor I thought of using it as a
>> replacement for a decorator in the old version of flume. We have a lot
>> of code that will take a log entry and replace the body with a
>> protocol buffer representation. I prefer to run this code on an
>> upstream tier from the web server. Interceptors would work fine for
>> the one in one out case.
>>
>
> Have you considered using an Interceptor or a custom source to generate a
> single event that has a series of timestamps within it? You could use
> protobufs for serialization of that data structure.
>
> Since you have multiple timestamps / timings on the same log line, I wonder
> if it isn't a single "event" with multiple facets and this isn't just a
> semantics thing.

I just used the multiple counters on a single line as an example. My
use case is much more complex and I thought it wouldn't add much to
the conversation. I need to have the multiple objects available to
hive. The upstream object is actually a protobuf with hierarchy. I was
planning on flattening the object for hive. Here is an example of what
I'm collecting. The actual protobuf has many more fields, but this
gives you an idea.

requestid
page
timestamp
useragent
impressions =[12345, 43212,12344,12345,43122, etc]

transforming for each impression.

requestid
page
timestamp
useragent
index
objectid

This gives me one row in hive per impression. This might be a little
more contextual. I picked the earlier example because I didn't want to
get caught up in my use case.  I could move this code to serializers
buy I need to do similar logic twice since I'm incrementing a counter
in hbase per impression and adding a row per impression in hdfs(hive).
If I transformed the event to multiple events earlier in the pipe. I
would only have to write code to generate keys per event. At this
point I'm going to implement two serializers. One to handle hdfs and
one for hbase.

Thanks again for your responses!
J
>
> Regards,
> Mike

Reply via email to