Another thought - for streaming analytics you'd need a system which scales so in retrospective how about something like a STORM SINK which internally can use FLUME again to write the processed event to a persistent SINK.
- Inder On Fri, Feb 8, 2013 at 10:25 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > Hi Steve, > > I can understand the idea of having data processed inside flume by > streaming it to another flume agent. But do we really need to re-engineer > something inside flume is what I am thinking? Core flume dev team may have > better ideas on this but currently for streaming data processing storm is a > huge candidate. > flume does have have an open jira on this integration > FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286> > > It will be interesting to draw up the comparisons in performance if the > data processing logic is added to to flume. We do see currently people > having a little bit of pre-processing of their data (they have their own > custom channel types where they modify the data and sink it) > > > On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <sya...@stevendyates.com>wrote: > >> Thanks for your feedback Mike, I have been thinking about this a little >> more and just using Mahout as an example I was considering the concept of >> somehow developing an enriched 'sink' so to speak where it would accept >> input streams / msgs from a flume channel and onforward specifically to a >> 'service' i.e Mahout service which would subsequently deliver the results >> to the configured sink. So yes it would behave as an >> intercept->filter->process->sink for applicable data items. >> >> I apologise if that is still vague. It would be great to receive further >> feedback from the user group. >> >> -Steve >> >> Mike Percy <mpe...@apache.org> wrote: >> Hi Steven, >> Thanks for chiming in! Please see my responses inline: >> >> On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <sya...@stevendyates.com>wrote: >> >>> The only missing link within the Flume architecture I see in this >>> conversation is the actual channel's and brokers themselves which >>> orchestrate this lovely undertaking of data collection. >>> >> >> Can you define what you mean by channels and brokers in this context? >> Since channel is a synonym for queueing event buffer in Flume parlance. >> Also, can you elaborate more on what you mean by orchestration? I think I >> know where you're going but I don't want to put words in your mouth. >> >> One opportunity I do see (and I may be wrong) is for the data to >>> offloaded into a system such as Apache Mahout before being sent to the >>> sink. Perhaps the concept of a ChannelAdapter of sorts? I.e Mahout Adapter >>> ? Just thinking out loud and it may be well out of the question. >>> >> >> Why not a Mahout sink? Since Mahout often wants sequence files in a >> particular format to begin its MapReduce processing (e.g. its k-Means >> clustering implementation), Flume is already a good fit with its HDFS sink >> and EventSerializers allowing for writing a plugin to format your data >> however it needs to go in. In fact that works today if you have a batch >> (even 5-minute batch) use case. With today's functionality, you could use >> Oozie to coordinate kicking off the Mahout M/R job periodically, as new >> data becomes available and the files are rolled. >> >> Perhaps even more interestingly, I can see a use case where you might >> want to use Mahout to do streaming / realtime updates driven by Flume in >> the form of an interceptor or a Mahout sink. If online machine learning >> (e.g. stochastic gradient descent or something else online) was what you >> were thinking, I wonder if there are any folks on this list who might have >> an interest in helping to work on putting such a thing together. >> >> In any case, I'd like to hear more about specific use cases for streaming >> analytics. :) >> >> Regards, >> Mike >> >> > > > -- > Nitin Pawar > -- - Inder "You are average of the 5 people you spend the most time with"