Sounds reasonable and agreed that seems well below ZK limits Best Regards, Iain
Sent from my iPhone > On Jul 28, 2016, at 2:00 AM, Thanh Hong Dai <[email protected]> wrote: > > If we are to use ZK, the formulae will be stored in something like this tree: > > /formula/<message_type>[/<field>] > > I’m not sure whether each formula should go into one node or not, but each > message type only has 10 formulae on average, and there are around 400-500 > message type. Correct me if I’m wrong, but from my reading on ZK, this > shouldn’t run into the ZK node size problem. > > The formulae are rarely changed, but we need to support such use case when > they do. We plan to cache the formula on heap and poll the primary source > once in a while for update. > > Best regards, > Thanh Hong. > > From: iain wright [mailto:[email protected]] > Sent: Thursday, 28 July, 2016 12:40 PM > To: [email protected] > Cc: Chris Horrocks <[email protected]> > Subject: Re: Is it a good idea to use Flume Interceptor to process data? > > You likely want to pose the ZK questions on the zookeeper list. I know I've > seen folks have problems when receiving >1MB of data in a response, and > definitely problems with > 200k children of a znode > > That said I've used it with hbase 0.94-98 with ~20k regions without issue, I > believe region severs use watchers vs polling > > How often do the formulas change? Below doc states there is a potential race > condition or gap in events with watchers, in that you need to set an > additional watcher after receiving an event > > Maybe it would be possible to use on heap cache, pub sub queue, and DB as a > source of truth? It's a pattern that has worked for us , although not in the > context of flume > > IE: > If you don't have the formula in cache go to DB (then cache it). > If you do have the formula in cache use it. > If something changes the formula, it writes to the DB and publishes a message > to a topic that all agents listen on, and agents change their formula based > on the published message. > > The caveat being if an agent ever disconnects from the pubsub topic, to > either self murder or go to the DB every time > > Relevant: > https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html > https://cwiki.apache.org/confluence/display/CURATOR/TN4 > > Sent from my iPhone > > On Jul 27, 2016, at 8:57 PM, Thanh Hong Dai <[email protected]> wrote: > > Hi, > > We actually attach the Interceptor to the source, as you have said. Sorry for > the confusion. > > (I also found out that I wrote “other streaming processing frameworks such as > Spark of Kafka”, which should be read as “other streaming processing > frameworks such as Spark or Storm”) > > Thanks for the suggestion about Zookeeper. We are aware of the configuration > storage functionality of Zookeeper, but we don’t have much experience using > it. Would storing around 5000 formula (usually simple ones, less than 100 > bytes) affect the overall performance of Zookeeper? To detect update, there > are 2 approaches: poll all the formulas, or use watcher. Which approach would > be better? > > The monitoring data is not latency sensitive – the process that put the data > of the last hour into Kafka only runs at 5th or 10th minute of the hour. We > are allowed to take one more hour to process the data (which means that we > can see the 8AM data at 10AM at the latest). > > Best regards, > Thanh Hong. > > From: Chris Horrocks [mailto:[email protected]] > Sent: Wednesday, 27 July, 2016 7:28 PM > To: [email protected] > Subject: Re: Is it a good idea to use Flume Interceptor to process data? > > Some rough initial thoughts: > > This is interesting but you might need to elaborate on how you've achieved > attaching an interceptor to a channel (and why, in lieu of attaching it to > the source): > we attach the Interceptor to the channel > Personally I'd have done this by feeding data into Spark Streaming and > keeping flume as low overhead as possible, particularily if it's monitoring > data that's latency sensitive. For storing the calculations variables for > consumption by the interceptor I'd go with something like ZooKeeper. > > > -- > Chris Horrocks > > > On Wed, Jul 27, 2016 at 12:39 pm, Thanh Hong Dai <'[email protected]'> wrote: > Hi, > > To give some background: We are currently buffering monitoring data into > Kafka, where each message in Kafka records several metrics at a point in time. > For each of the record, we need to perform some calculation based on the > metrics in the record, append the results (multiple of them) to the record > and send the resulting record into a data store (let’s call it DS1). All data > required for the calculation are encapsulated in the record, essentially > making this an embarrassingly parallel problem. > The formula for the calculation is stored in a different data store (let’s > call it DS2), and can be changed (add/delete/modified by user). We are not > required to react to the change immediately, but we should do so in > reasonable time (e.g. 5 minutes). > > Currently, we have prototyped an implementation which implements the data > processing as described above in an Interceptor. We define the source as > Kafka, the Sink as the sink for DS2, and we attach the Interceptor to the > channel. As described above, the Interceptor will be reading the formula from > DS1 regularly for any change, and will be responsible for processing the data > as they come in from Kafka. > > We are aware of other streaming processing frameworks such as Spark of Kafka. > However, the implementation above is motivated by the fact that Flume has > provided reliable streaming, and we want to reuse as much code as possible. > > Is this usage of Flume a good idea in term of performance and scalability? > > Best regards, > Hong Dai Thanh.
