Re: Analysis of Data

Inder Pall Fri, 08 Feb 2013 00:48:36 -0800

Another thought - for streaming analytics you'd need a system which scales
so in retrospective how about something like a STORM SINK which internally
can use FLUME again to write the processed event to a persistent SINK.


- Inder

On Fri, Feb 8, 2013 at 10:25 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote:

> Hi Steve,
>
> I can understand  the idea of having data processed inside flume by
> streaming it to another flume agent. But do we really need to re-engineer
> something inside flume is what I am thinking? Core flume dev team may have
> better ideas on this but currently for streaming data processing storm is a
> huge candidate.
> flume does have have an open jira on this integration 
> FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286>
>
> It will be interesting to draw up the comparisons in performance if the
> data processing logic is added to to flume. We do see currently people
> having a little bit of pre-processing of their data (they have their own
> custom channel types where they modify the data and sink it)
>
>
> On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <sya...@stevendyates.com>wrote:
>
>> Thanks for your feedback Mike, I have been thinking about this a little
>> more and just using Mahout as an example I was considering the concept of
>> somehow developing an enriched 'sink' so to speak where it would accept
>> input streams / msgs from a flume channel and onforward specifically to a
>> 'service' i.e Mahout service which would subsequently deliver the results
>> to the configured sink. So yes it would behave as an
>> intercept->filter->process->sink for applicable data items.
>>
>> I apologise if that is still vague. It would be great to receive further
>> feedback from the user group.
>>
>> -Steve
>>
>> Mike Percy <mpe...@apache.org> wrote:
>> Hi Steven,
>> Thanks for chiming in! Please see my responses inline:
>>
>> On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <sya...@stevendyates.com>wrote:
>>
>>> The only missing link within the Flume architecture I see in this
>>> conversation is the actual channel's and brokers themselves which
>>> orchestrate this lovely undertaking of data collection.
>>>
>>
>> Can you define what you mean by channels and brokers in this context?
>> Since channel is a synonym for queueing event buffer in Flume parlance.
>> Also, can you elaborate more on what you mean by orchestration? I think I
>> know where you're going but I don't want to put words in your mouth.
>>
>> One opportunity I do see (and I may be wrong) is for the data to
>>> offloaded into a system such as Apache Mahout  before being sent to the
>>> sink. Perhaps the concept of a ChannelAdapter of sorts? I.e Mahout Adapter
>>> ? Just thinking out loud and it may be well out of the question.
>>>
>>
>> Why not a Mahout sink? Since Mahout often wants sequence files in a
>> particular format to begin its MapReduce processing (e.g. its k-Means
>> clustering implementation), Flume is already a good fit with its HDFS sink
>> and EventSerializers allowing for writing a plugin to format your data
>> however it needs to go in. In fact that works today if you have a batch
>> (even 5-minute batch) use case. With today's functionality, you could use
>> Oozie to coordinate kicking off the Mahout M/R job periodically, as new
>> data becomes available and the files are rolled.
>>
>> Perhaps even more interestingly, I can see a use case where you might
>> want to use Mahout to do streaming / realtime updates driven by Flume in
>> the form of an interceptor or a Mahout sink. If online machine learning
>> (e.g. stochastic gradient descent or something else online) was what you
>> were thinking, I wonder if there are any folks on this list who might have
>> an interest in helping to work on putting such a thing together.
>>
>> In any case, I'd like to hear more about specific use cases for streaming
>> analytics. :)
>>
>> Regards,
>> Mike
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
- Inder
"You are average of the 5 people you spend the most time with"

Re: Analysis of Data

Reply via email to