Re: Is it a good idea to use Flume Interceptor to process data?

iain wright Thu, 28 Jul 2016 06:05:28 -0700

Sounds reasonable and agreed that seems well below ZK limits 

Best Regards,
Iain



Sent from my iPhone

> On Jul 28, 2016, at 2:00 AM, Thanh Hong Dai <[email protected]> wrote:
> 
> If we are to use ZK, the formulae will be stored in something like this tree:
>  
> /formula/<message_type>[/<field>]
>  
> I’m not sure whether each formula should go into one node or not, but each 
> message type only has 10 formulae on average, and there are around 400-500 
> message type. Correct me if I’m wrong, but from my reading on ZK, this 
> shouldn’t run into the ZK node size problem.
>  
> The formulae are rarely changed, but we need to support such use case when 
> they do. We plan to cache the formula on heap and poll the primary source 
> once in a while for update.
>  
> Best regards,
> Thanh Hong.
>  
> From: iain wright [mailto:[email protected]] 
> Sent: Thursday, 28 July, 2016 12:40 PM
> To: [email protected]
> Cc: Chris Horrocks <[email protected]>
> Subject: Re: Is it a good idea to use Flume Interceptor to process data?
>  
> You likely want to pose the ZK questions on the zookeeper list. I know I've 
> seen folks have problems when receiving >1MB of data in a response, and 
> definitely problems with > 200k children of a znode
>  
> That said I've used it with hbase 0.94-98 with ~20k regions without issue, I 
> believe region severs use watchers vs polling 
> 
> How often do the formulas change? Below doc states there is a potential race 
> condition or gap in events with watchers, in that you need to set an 
> additional watcher after receiving an event
> 
> Maybe it would be possible to use on heap cache, pub sub queue, and DB as a 
> source of truth? It's a pattern that has worked for us , although not in the 
> context of flume
>  
> IE:
> If you don't have the formula in cache go to DB (then cache it).
> If you do have the formula in cache use it.
> If something changes the formula, it writes to the DB and publishes a message 
> to a topic that all agents listen on, and agents change their formula based 
> on the published message. 
>  
> The caveat being if an agent ever disconnects from the pubsub topic, to 
> either self murder or go to the DB every time
>  
> Relevant:
> https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html
> https://cwiki.apache.org/confluence/display/CURATOR/TN4
>  
> Sent from my iPhone
> 
> On Jul 27, 2016, at 8:57 PM, Thanh Hong Dai <[email protected]> wrote:
> 
> Hi,
>  
> We actually attach the Interceptor to the source, as you have said. Sorry for 
> the confusion.
>  
> (I also found out that I wrote “other streaming processing frameworks such as 
> Spark of Kafka”, which should be read as “other streaming processing 
> frameworks such as Spark or Storm”)
>  
> Thanks for the suggestion about Zookeeper. We are aware of the configuration 
> storage functionality of Zookeeper, but we don’t have much experience using 
> it. Would storing around 5000 formula (usually simple ones, less than 100 
> bytes) affect the overall performance of Zookeeper? To detect update, there 
> are 2 approaches: poll all the formulas, or use watcher. Which approach would 
> be better?
>  
> The monitoring data is not latency sensitive – the process that put the data 
> of the last hour into Kafka only runs at 5th or 10th minute of the hour. We 
> are allowed to take one more hour to process the data (which means that we 
> can see the 8AM data at 10AM at the latest).
>  
> Best regards,
> Thanh Hong.
>  
> From: Chris Horrocks [mailto:[email protected]] 
> Sent: Wednesday, 27 July, 2016 7:28 PM
> To: [email protected]
> Subject: Re: Is it a good idea to use Flume Interceptor to process data?
>  
> Some rough initial thoughts:
>  
> This is interesting but you might need to elaborate on how you've achieved 
> attaching an interceptor to a channel (and why, in lieu of attaching it to 
> the source):
> we attach the Interceptor to the channel
> Personally I'd have done this by feeding data into Spark Streaming and 
> keeping flume as low overhead as possible, particularily if it's monitoring 
> data that's latency sensitive. For storing the calculations variables for 
> consumption by the interceptor I'd go with something like ZooKeeper. 
>  
>  
> --
> Chris Horrocks
>  
>  
> On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'[email protected]'> wrote:
> Hi,
>  
> To give some background: We are currently buffering monitoring data into 
> Kafka, where each message in Kafka records several metrics at a point in time.
> For each of the record, we need to perform some calculation based on the 
> metrics in the record, append the results (multiple of them) to the record 
> and send the resulting record into a data store (let’s call it DS1). All data 
> required for the calculation are encapsulated in the record, essentially 
> making this an embarrassingly parallel problem.
> The formula for the calculation is stored in a different data store (let’s 
> call it DS2), and can be changed (add/delete/modified by user). We are not 
> required to react to the change immediately, but we should do so in 
> reasonable time (e.g. 5 minutes).
>  
> Currently, we have prototyped an implementation which implements the data 
> processing as described above in an Interceptor. We define the source as 
> Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the 
> channel. As described above, the Interceptor will be reading the formula from 
> DS1 regularly for any change, and will be responsible for processing the data 
> as they come in from Kafka.
>  
> We are aware of other streaming processing frameworks such as Spark of Kafka. 
> However, the implementation above is motivated by the fact that Flume has 
> provided reliable streaming, and we want to reuse as much code as possible.
>  
> Is this usage of Flume a good idea in term of performance and scalability?
>  
> Best regards,
> Hong Dai Thanh.

Re: Is it a good idea to use Flume Interceptor to process data?

Reply via email to