Re: understanding partition key

David McNelis Thu, 12 Feb 2015 09:18:41 -0800

Sorry... I probably didn't word that well.  When I say each message, that's
assuming your going out to poll for new messages on the topic.  I.e. if you
have a latency tolerance of 1 second, then you'd never want to go more than
1 second without polling for new messages.


On Thu, Feb 12, 2015 at 12:15 PM, Gary Ogden <gog...@gmail.com> wrote:

> Thanks.
>
> I was under the impression that you can't process the data as each record
> comes into the topic? That Kafka doesn't support pushing messages to a
> consumer like a traditional message queue?
>
> On 12 February 2015 at 11:06, David McNelis <dmcne...@emergingthreats.net>
> wrote:
>
> > I'm going to go a bit in reverse for your questions. We built a restful
> API
> > to push data to so that we could submit things from multiple sources that
> > aren't necessarily things that our team would maintain, as well as
> validate
> > that data before we send it off to a topic.
> >
> > As for consumers... we expect that a consumer will receive everything in
> a
> > particular partition, but definitely not everything in a topic.  For
> > example, for DNS information, a consumer expects to see everything for a
> > domain (our partition key there), but not for all domains... it would be
> > too much data to handle with a low enough latency.
> >
> > For us, our latencies vary by what we're processing at a given moment or
> > other large jobs that might be running outside of the stream processing.
> > Generally speaking we keep latency within our limits by adjusting the
> > number of partitions for a topic, and subsequently the number of
> > consumers...which works until you hit a wall with your data store, in
> which
> > case tweaking must happen there too.
> >
> > Your flow of data will be the biggest impact on latency (along with batch
> > size)....  If you see 100 events a minute, and have a batch size of 1k,
> > you'll have 10 minutes of latency... if you process your data as each
> > record comes in, then as long as your consumer can keep up with that
> load,
> > you'll have very low latency.
> >
> > Kafka's latency has never been an issue for us, and we've always found
> > there's something that we're doing that is affecting the overall
> throughput
> > of the systems, be it needing to play with the number of partitions,
> > adjusting batch size, resizing the hadoop cluster to meet increased need
> > there.
> >
> > On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <gog...@gmail.com> wrote:
> >
> > > Thanks again David. So what kind of latencies are you experiencing with
> > > this? If I wanted to act upon certain events in this and send out
> alarms
> > > (email, sms etc), what kind of delays are you seeing by the time you're
> > > able to process them?
> > >
> > > It seems if you were to create an alarm topic, and dump alerts on there
> > to
> > > be processed, and have a consumer then process those alerts (save to
> > > Cassandra and send out the notification), you could see some sort of
> > delay.
> > >
> > > But none of you consumers are expecting that they will be getting all
> the
> > > data for that topic, are they? That's the hitch for me. I think I could
> > > rework the design so that no consumer would assume it's getting all the
> > > messages in the topic.
> > >
> > > When you say central producer stack, this is something you built
> outside
> > of
> > > kafka I assume.
> > >
> > > On 12 February 2015 at 09:40, David McNelis <
> > dmcne...@emergingthreats.net>
> > > wrote:
> > >
> > > > In our setup we deal with a similar situation (lots of time-series
> data
> > > > that we have to aggregate in a number of different ways).
> > > >
> > > > Our approach is to push all of our data to a central producer stack,
> > that
> > > > stack then submits data to different topics, depending on a set of
> > > > predetermined rules.
> > > >
> > > > For arguments sake, lets say we see 100 events / second... Once the
> > data
> > > is
> > > > pushed to topics we have several consumer applications.  The first
> > pushes
> > > > the lowest level raw data into our cluster (in our case HBase, but
> > could
> > > > just as easily be C*). This first consumer we use a small batch
> > size....
> > > > say 100 records.
> > > >
> > > > Our second consumer does some aggregation and summarization and needs
> > to
> > > > reference additional resources.  This batch size is considerably
> > larger,
> > > > say 1000 records in size.
> > > >
> > > > Our third consumer does a much larger aggregation where we're
> reducing
> > it
> > > > down to a much smaller dataset, by orders of magnitude, and we'll run
> > > that
> > > > batch size at 10k records.
> > > >
> > > > Our aggregations are oftentimes time based, like rolling events /
> > > counters
> > > > to the hour or day level... so while we may occasionally re-process
> > some
> > > of
> > > > the same information throughout the day, it lets us maintain a system
> > > where
> > > > we have near-real-time access to most of the data we're ingesting.
> > > >
> > > > This certainly is something we've had to tweak in terms of the
> numbers
> > of
> > > > consumers / partitions and batch sizes to get to work optimally, but
> > > we're
> > > > pretty happy with it at the moment.
> > > >
> > > > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <gog...@gmail.com>
> wrote:
> > > >
> > > > > Thanks David. Whether Kafka is the right choice is exactly what I'm
> > > > trying
> > > > > to determine.  Everything I want to do with these events is time
> > based.
> > > > > Store them in the topic for 24 hours. Read from the topics and get
> > data
> > > > for
> > > > > a time period (last hour , last 8 hours etc). This reading from the
> > > > topics
> > > > > could happen numerous times for the same dataset.  At the end of
> the
> > 24
> > > > > hours, we would then want to store the events in long term storage
> in
> > > > case
> > > > > we need them in the future.
> > > > >
> > > > > I'm thinking Kafka isn't the right fit for this use case though.
> > > > However,
> > > > > we have cassandra already installed and running. Maybe we use kafka
> > as
> > > > the
> > > > > in-between... So dump the events onto kafka, have consumers that
> > > process
> > > > > the events and dump them into Cassandra and then we read the the
> time
> > > > > series data we need from Cassandra and not kafka.
> > > > >
> > > > > My concern with this method would be two things:
> > > > > 1 - the time delay of dumping them on the queue and then processing
> > > them
> > > > > into Cassandra before the events are available.
> > > > > 2 - Is there any need for kafka now? Why can't the code that puts
> the
> > > > > messages on the topic just directly insert into Cassandra then and
> > > avoid
> > > > > this extra hop?
> > > > >
> > > > >
> > > > >
> > > > > On 12 February 2015 at 09:01, David McNelis <
> > > > dmcne...@emergingthreats.net>
> > > > > wrote:
> > > > >
> > > > > > Gary,
> > > > > >
> > > > > > That is certainly a valid use case.  What Zijing was saying is
> that
> > > you
> > > > > can
> > > > > > only have 1 consumer per consumer application per partition.
> > > > > >
> > > > > > I think that what it boils down to is how you want your
> information
> > > > > grouped
> > > > > > inside your timeframes.  For example, if you want to have
> > everything
> > > > for
> > > > > a
> > > > > > specific user, then you could use that as your partition key,
> > > ensuring
> > > > > that
> > > > > > any data for that user is processed by the same consumer (in
> > however
> > > > many
> > > > > > consumer applications you opt to run).
> > > > > >
> > > > > > The second point that I think Zijing was getting at was whether
> or
> > > not
> > > > > your
> > > > > > proposed use case makes sense for Kafka.  If your goal is to do
> > > > > > time-interval batch processing (versus N-record batches), then
> why
> > > use
> > > > > > Kafka for it?  Why not use something more adept at batch
> > processing?
> > > > For
> > > > > > example, if you're using HBase you can use Pig jobs that would
> read
> > > > only
> > > > > > the records created between specific timestamps.
> > > > > >
> > > > > > David
> > > > > >
> > > > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <gog...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > So it's not possible to have 1 topic with 1 partition and many
> > > > > consumers
> > > > > > of
> > > > > > > that topic?
> > > > > > >
> > > > > > > My intention is to have a topic with many consumers, but each
> > > > consumer
> > > > > > > needs to be able to have access to all the messages in that
> > topic.
> > > > > > >
> > > > > > > On 11 February 2015 at 20:42, Zijing Guo
> > > <alter...@yahoo.com.invalid
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Partition key is on producer level, that if you have multiple
> > > > > > partitions
> > > > > > > > for a single topic, then you can pass in a key for the
> > > KeyedMessage
> > > > > > > object,
> > > > > > > > and base on different partition.class, it will return a
> > partition
> > > > > > number
> > > > > > > > for the producer, and producer will find the leader for that
> > > > > > partition.I
> > > > > > > > don't know how kafka could handle time series case, but
> depends
> > > on
> > > > > how
> > > > > > > many
> > > > > > > > partitions for that topic. If you only have 1 partition, then
> > you
> > > > > don't
> > > > > > > > need to worry about order at all, since each consumer group
> can
> > > > only
> > > > > > > allow
> > > > > > > > 1 consumer instance to consume that data.  if you have
> multiple
> > > > > > > partitions
> > > > > > > > (say 3 for example), then you can fire up 3 consumer
> instances
> > > > under
> > > > > > the
> > > > > > > > same consumer group, and each will only consume 1 partition's
> > > data.
> > > > > if
> > > > > > > > order in each partition matters, then you need to do some
> work
> > on
> > > > the
> > > > > > > > producer side.Hope this helpsEdwin
> > > > > > > >
> > > > > > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > > > > > gog...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >  I'm trying to understand how the partition key works and
> > > whether I
> > > > > > need
> > > > > > > to
> > > > > > > > specify a partition key for my topics or not.  What happens
> if
> > I
> > > > > don't
> > > > > > > > specify a PK and I have more than one consumer that wants all
> > > > > messages
> > > > > > > in a
> > > > > > > > topic for a certain period of time? Will those consumers get
> > all
> > > > the
> > > > > > > > messages, but they just may not be ordered correctly?
> > > > > > > >
> > > > > > > > The current scenario is that we will have events going into a
> > > topic
> > > > > > based
> > > > > > > > on customer and the data will remain in the topic for 24
> hours.
> > > We
> > > > > will
> > > > > > > > then have multiple consumers reading messages from that
> topic.
> > > They
> > > > > > will
> > > > > > > > want to be able to get them out over a time range (could be
> > last
> > > > > hour,
> > > > > > > last
> > > > > > > > 8 hours etc).
> > > > > > > >
> > > > > > > > So if I specify the same PK for each subscriber, then each
> > > consumer
> > > > > > will
> > > > > > > > get all messages in the correct order?  If I don't specify
> the
> > PK
> > > > or
> > > > > > use
> > > > > > > a
> > > > > > > > random one, will each consumer still get all the messages but
> > > they
> > > > > just
> > > > > > > > won't be ordered correctly?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: understanding partition key

Reply via email to