RE: Is it a good idea to use Flume Interceptor to process data?

Thanh Hong Dai Wed, 27 Jul 2016 20:59:09 -0700

Hi,


We actually attach the Interceptor to the source, as you have said. Sorry for 
the confusion.

 

(I also found out that I wrote “other streaming processing frameworks such as 
Spark of Kafka”, which should be read as “other streaming processing frameworks 
such as Spark or Storm”)

 

Thanks for the suggestion about Zookeeper. We are aware of the configuration 
storage functionality of Zookeeper, but we don’t have much experience using it. 
Would storing around 5000 formula (usually simple ones, less than 100 bytes) 
affect the overall performance of Zookeeper? To detect update, there are 2 
approaches: poll all the formulas, or use watcher. Which approach would be 
better?

 

The monitoring data is not latency sensitive – the process that put the data of 
the last hour into Kafka only runs at 5th or 10th minute of the hour. We are 
allowed to take one more hour to process the data (which means that we can see 
the 8AM data at 10AM at the latest).

 

Best regards,

Thanh Hong.

 

From: Chris Horrocks [mailto:[email protected]] 
Sent: Wednesday, 27 July, 2016 7:28 PM
To: [email protected]
Subject: Re: Is it a good idea to use Flume Interceptor to process data?

 

Some rough initial thoughts:

 

This is interesting but you might need to elaborate on how you've achieved 
attaching an interceptor to a channel (and why, in lieu of attaching it to the 
source):

we attach the Interceptor to the channel 

Personally I'd have done this by feeding data into Spark Streaming and keeping 
flume as low overhead as possible, particularily if it's monitoring data that's 
latency sensitive. For storing the calculations variables for consumption by 
the interceptor I'd go with something like ZooKeeper. 

 

 

-- 

Chris Horrocks

 

 

On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'[email protected]'> wrote:

Hi, 

  

To give some background: We are currently buffering monitoring data into Kafka, 
where each message in Kafka records several metrics at a point in time. 

For each of the record, we need to perform some calculation based on the 
metrics in the record, append the results (multiple of them) to the record and 
send the resulting record into a data store (let’s call it DS1). All data 
required for the calculation are encapsulated in the record, essentially making 
this an embarrassingly parallel problem. 

The formula for the calculation is stored in a different data store (let’s call 
it DS2), and can be changed (add/delete/modified by user). We are not required 
to react to the change immediately, but we should do so in reasonable time 
(e.g. 5 minutes). 

  

Currently, we have prototyped an implementation which implements the data 
processing as described above in an Interceptor. We define the source as Kafka, 
the Sink as the sink for DS2, and we attach the Interceptor to the channel. As 
described above, the Interceptor will be reading the formula from DS1 regularly 
for any change, and will be responsible for processing the data as they come in 
from Kafka. 

  

We are aware of other streaming processing frameworks such as Spark of Kafka. 
However, the implementation above is motivated by the fact that Flume has 
provided reliable streaming, and we want to reuse as much code as possible. 

  

Is this usage of Flume a good idea in term of performance and scalability? 

  

Best regards, 

Hong Dai Thanh.

RE: Is it a good idea to use Flume Interceptor to process data?

Reply via email to