Sorry... I probably didn't word that well. When I say each message, that's assuming your going out to poll for new messages on the topic. I.e. if you have a latency tolerance of 1 second, then you'd never want to go more than 1 second without polling for new messages.
On Thu, Feb 12, 2015 at 12:15 PM, Gary Ogden <gog...@gmail.com> wrote: > Thanks. > > I was under the impression that you can't process the data as each record > comes into the topic? That Kafka doesn't support pushing messages to a > consumer like a traditional message queue? > > On 12 February 2015 at 11:06, David McNelis <dmcne...@emergingthreats.net> > wrote: > > > I'm going to go a bit in reverse for your questions. We built a restful > API > > to push data to so that we could submit things from multiple sources that > > aren't necessarily things that our team would maintain, as well as > validate > > that data before we send it off to a topic. > > > > As for consumers... we expect that a consumer will receive everything in > a > > particular partition, but definitely not everything in a topic. For > > example, for DNS information, a consumer expects to see everything for a > > domain (our partition key there), but not for all domains... it would be > > too much data to handle with a low enough latency. > > > > For us, our latencies vary by what we're processing at a given moment or > > other large jobs that might be running outside of the stream processing. > > Generally speaking we keep latency within our limits by adjusting the > > number of partitions for a topic, and subsequently the number of > > consumers...which works until you hit a wall with your data store, in > which > > case tweaking must happen there too. > > > > Your flow of data will be the biggest impact on latency (along with batch > > size).... If you see 100 events a minute, and have a batch size of 1k, > > you'll have 10 minutes of latency... if you process your data as each > > record comes in, then as long as your consumer can keep up with that > load, > > you'll have very low latency. > > > > Kafka's latency has never been an issue for us, and we've always found > > there's something that we're doing that is affecting the overall > throughput > > of the systems, be it needing to play with the number of partitions, > > adjusting batch size, resizing the hadoop cluster to meet increased need > > there. > > > > On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <gog...@gmail.com> wrote: > > > > > Thanks again David. So what kind of latencies are you experiencing with > > > this? If I wanted to act upon certain events in this and send out > alarms > > > (email, sms etc), what kind of delays are you seeing by the time you're > > > able to process them? > > > > > > It seems if you were to create an alarm topic, and dump alerts on there > > to > > > be processed, and have a consumer then process those alerts (save to > > > Cassandra and send out the notification), you could see some sort of > > delay. > > > > > > But none of you consumers are expecting that they will be getting all > the > > > data for that topic, are they? That's the hitch for me. I think I could > > > rework the design so that no consumer would assume it's getting all the > > > messages in the topic. > > > > > > When you say central producer stack, this is something you built > outside > > of > > > kafka I assume. > > > > > > On 12 February 2015 at 09:40, David McNelis < > > dmcne...@emergingthreats.net> > > > wrote: > > > > > > > In our setup we deal with a similar situation (lots of time-series > data > > > > that we have to aggregate in a number of different ways). > > > > > > > > Our approach is to push all of our data to a central producer stack, > > that > > > > stack then submits data to different topics, depending on a set of > > > > predetermined rules. > > > > > > > > For arguments sake, lets say we see 100 events / second... Once the > > data > > > is > > > > pushed to topics we have several consumer applications. The first > > pushes > > > > the lowest level raw data into our cluster (in our case HBase, but > > could > > > > just as easily be C*). This first consumer we use a small batch > > size.... > > > > say 100 records. > > > > > > > > Our second consumer does some aggregation and summarization and needs > > to > > > > reference additional resources. This batch size is considerably > > larger, > > > > say 1000 records in size. > > > > > > > > Our third consumer does a much larger aggregation where we're > reducing > > it > > > > down to a much smaller dataset, by orders of magnitude, and we'll run > > > that > > > > batch size at 10k records. > > > > > > > > Our aggregations are oftentimes time based, like rolling events / > > > counters > > > > to the hour or day level... so while we may occasionally re-process > > some > > > of > > > > the same information throughout the day, it lets us maintain a system > > > where > > > > we have near-real-time access to most of the data we're ingesting. > > > > > > > > This certainly is something we've had to tweak in terms of the > numbers > > of > > > > consumers / partitions and batch sizes to get to work optimally, but > > > we're > > > > pretty happy with it at the moment. > > > > > > > > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <gog...@gmail.com> > wrote: > > > > > > > > > Thanks David. Whether Kafka is the right choice is exactly what I'm > > > > trying > > > > > to determine. Everything I want to do with these events is time > > based. > > > > > Store them in the topic for 24 hours. Read from the topics and get > > data > > > > for > > > > > a time period (last hour , last 8 hours etc). This reading from the > > > > topics > > > > > could happen numerous times for the same dataset. At the end of > the > > 24 > > > > > hours, we would then want to store the events in long term storage > in > > > > case > > > > > we need them in the future. > > > > > > > > > > I'm thinking Kafka isn't the right fit for this use case though. > > > > However, > > > > > we have cassandra already installed and running. Maybe we use kafka > > as > > > > the > > > > > in-between... So dump the events onto kafka, have consumers that > > > process > > > > > the events and dump them into Cassandra and then we read the the > time > > > > > series data we need from Cassandra and not kafka. > > > > > > > > > > My concern with this method would be two things: > > > > > 1 - the time delay of dumping them on the queue and then processing > > > them > > > > > into Cassandra before the events are available. > > > > > 2 - Is there any need for kafka now? Why can't the code that puts > the > > > > > messages on the topic just directly insert into Cassandra then and > > > avoid > > > > > this extra hop? > > > > > > > > > > > > > > > > > > > > On 12 February 2015 at 09:01, David McNelis < > > > > dmcne...@emergingthreats.net> > > > > > wrote: > > > > > > > > > > > Gary, > > > > > > > > > > > > That is certainly a valid use case. What Zijing was saying is > that > > > you > > > > > can > > > > > > only have 1 consumer per consumer application per partition. > > > > > > > > > > > > I think that what it boils down to is how you want your > information > > > > > grouped > > > > > > inside your timeframes. For example, if you want to have > > everything > > > > for > > > > > a > > > > > > specific user, then you could use that as your partition key, > > > ensuring > > > > > that > > > > > > any data for that user is processed by the same consumer (in > > however > > > > many > > > > > > consumer applications you opt to run). > > > > > > > > > > > > The second point that I think Zijing was getting at was whether > or > > > not > > > > > your > > > > > > proposed use case makes sense for Kafka. If your goal is to do > > > > > > time-interval batch processing (versus N-record batches), then > why > > > use > > > > > > Kafka for it? Why not use something more adept at batch > > processing? > > > > For > > > > > > example, if you're using HBase you can use Pig jobs that would > read > > > > only > > > > > > the records created between specific timestamps. > > > > > > > > > > > > David > > > > > > > > > > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <gog...@gmail.com> > > > wrote: > > > > > > > > > > > > > So it's not possible to have 1 topic with 1 partition and many > > > > > consumers > > > > > > of > > > > > > > that topic? > > > > > > > > > > > > > > My intention is to have a topic with many consumers, but each > > > > consumer > > > > > > > needs to be able to have access to all the messages in that > > topic. > > > > > > > > > > > > > > On 11 February 2015 at 20:42, Zijing Guo > > > <alter...@yahoo.com.invalid > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Partition key is on producer level, that if you have multiple > > > > > > partitions > > > > > > > > for a single topic, then you can pass in a key for the > > > KeyedMessage > > > > > > > object, > > > > > > > > and base on different partition.class, it will return a > > partition > > > > > > number > > > > > > > > for the producer, and producer will find the leader for that > > > > > > partition.I > > > > > > > > don't know how kafka could handle time series case, but > depends > > > on > > > > > how > > > > > > > many > > > > > > > > partitions for that topic. If you only have 1 partition, then > > you > > > > > don't > > > > > > > > need to worry about order at all, since each consumer group > can > > > > only > > > > > > > allow > > > > > > > > 1 consumer instance to consume that data. if you have > multiple > > > > > > > partitions > > > > > > > > (say 3 for example), then you can fire up 3 consumer > instances > > > > under > > > > > > the > > > > > > > > same consumer group, and each will only consume 1 partition's > > > data. > > > > > if > > > > > > > > order in each partition matters, then you need to do some > work > > on > > > > the > > > > > > > > producer side.Hope this helpsEdwin > > > > > > > > > > > > > > > > On Wednesday, February 11, 2015 3:14 PM, Gary Ogden < > > > > > > > gog...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > I'm trying to understand how the partition key works and > > > whether I > > > > > > need > > > > > > > to > > > > > > > > specify a partition key for my topics or not. What happens > if > > I > > > > > don't > > > > > > > > specify a PK and I have more than one consumer that wants all > > > > > messages > > > > > > > in a > > > > > > > > topic for a certain period of time? Will those consumers get > > all > > > > the > > > > > > > > messages, but they just may not be ordered correctly? > > > > > > > > > > > > > > > > The current scenario is that we will have events going into a > > > topic > > > > > > based > > > > > > > > on customer and the data will remain in the topic for 24 > hours. > > > We > > > > > will > > > > > > > > then have multiple consumers reading messages from that > topic. > > > They > > > > > > will > > > > > > > > want to be able to get them out over a time range (could be > > last > > > > > hour, > > > > > > > last > > > > > > > > 8 hours etc). > > > > > > > > > > > > > > > > So if I specify the same PK for each subscriber, then each > > > consumer > > > > > > will > > > > > > > > get all messages in the correct order? If I don't specify > the > > PK > > > > or > > > > > > use > > > > > > > a > > > > > > > > random one, will each consumer still get all the messages but > > > they > > > > > just > > > > > > > > won't be ordered correctly? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >