Re: understanding partition key

David McNelis Thu, 12 Feb 2015 05:42:07 -0800

In our setup we deal with a similar situation (lots of time-series data
that we have to aggregate in a number of different ways).


Our approach is to push all of our data to a central producer stack, that
stack then submits data to different topics, depending on a set of
predetermined rules.

For arguments sake, lets say we see 100 events / second... Once the data is
pushed to topics we have several consumer applications.  The first pushes
the lowest level raw data into our cluster (in our case HBase, but could
just as easily be C*). This first consumer we use a small batch size....
say 100 records.

Our second consumer does some aggregation and summarization and needs to
reference additional resources.  This batch size is considerably larger,
say 1000 records in size.

Our third consumer does a much larger aggregation where we're reducing it
down to a much smaller dataset, by orders of magnitude, and we'll run that
batch size at 10k records.

Our aggregations are oftentimes time based, like rolling events / counters
to the hour or day level... so while we may occasionally re-process some of
the same information throughout the day, it lets us maintain a system where
we have near-real-time access to most of the data we're ingesting.

This certainly is something we've had to tweak in terms of the numbers of
consumers / partitions and batch sizes to get to work optimally, but we're
pretty happy with it at the moment.

On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <gog...@gmail.com> wrote:

> Thanks David. Whether Kafka is the right choice is exactly what I'm trying
> to determine.  Everything I want to do with these events is time based.
> Store them in the topic for 24 hours. Read from the topics and get data for
> a time period (last hour , last 8 hours etc). This reading from the topics
> could happen numerous times for the same dataset.  At the end of the 24
> hours, we would then want to store the events in long term storage in case
> we need them in the future.
>
> I'm thinking Kafka isn't the right fit for this use case though.  However,
> we have cassandra already installed and running. Maybe we use kafka as the
> in-between... So dump the events onto kafka, have consumers that process
> the events and dump them into Cassandra and then we read the the time
> series data we need from Cassandra and not kafka.
>
> My concern with this method would be two things:
> 1 - the time delay of dumping them on the queue and then processing them
> into Cassandra before the events are available.
> 2 - Is there any need for kafka now? Why can't the code that puts the
> messages on the topic just directly insert into Cassandra then and avoid
> this extra hop?
>
>
>
> On 12 February 2015 at 09:01, David McNelis <dmcne...@emergingthreats.net>
> wrote:
>
> > Gary,
> >
> > That is certainly a valid use case.  What Zijing was saying is that you
> can
> > only have 1 consumer per consumer application per partition.
> >
> > I think that what it boils down to is how you want your information
> grouped
> > inside your timeframes.  For example, if you want to have everything for
> a
> > specific user, then you could use that as your partition key, ensuring
> that
> > any data for that user is processed by the same consumer (in however many
> > consumer applications you opt to run).
> >
> > The second point that I think Zijing was getting at was whether or not
> your
> > proposed use case makes sense for Kafka.  If your goal is to do
> > time-interval batch processing (versus N-record batches), then why use
> > Kafka for it?  Why not use something more adept at batch processing?  For
> > example, if you're using HBase you can use Pig jobs that would read only
> > the records created between specific timestamps.
> >
> > David
> >
> > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <gog...@gmail.com> wrote:
> >
> > > So it's not possible to have 1 topic with 1 partition and many
> consumers
> > of
> > > that topic?
> > >
> > > My intention is to have a topic with many consumers, but each consumer
> > > needs to be able to have access to all the messages in that topic.
> > >
> > > On 11 February 2015 at 20:42, Zijing Guo <alter...@yahoo.com.invalid>
> > > wrote:
> > >
> > > > Partition key is on producer level, that if you have multiple
> > partitions
> > > > for a single topic, then you can pass in a key for the KeyedMessage
> > > object,
> > > > and base on different partition.class, it will return a partition
> > number
> > > > for the producer, and producer will find the leader for that
> > partition.I
> > > > don't know how kafka could handle time series case, but depends on
> how
> > > many
> > > > partitions for that topic. If you only have 1 partition, then you
> don't
> > > > need to worry about order at all, since each consumer group can only
> > > allow
> > > > 1 consumer instance to consume that data.  if you have multiple
> > > partitions
> > > > (say 3 for example), then you can fire up 3 consumer instances under
> > the
> > > > same consumer group, and each will only consume 1 partition's data.
> if
> > > > order in each partition matters, then you need to do some work on the
> > > > producer side.Hope this helpsEdwin
> > > >
> > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > gog...@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  I'm trying to understand how the partition key works and whether I
> > need
> > > to
> > > > specify a partition key for my topics or not.  What happens if I
> don't
> > > > specify a PK and I have more than one consumer that wants all
> messages
> > > in a
> > > > topic for a certain period of time? Will those consumers get all the
> > > > messages, but they just may not be ordered correctly?
> > > >
> > > > The current scenario is that we will have events going into a topic
> > based
> > > > on customer and the data will remain in the topic for 24 hours. We
> will
> > > > then have multiple consumers reading messages from that topic. They
> > will
> > > > want to be able to get them out over a time range (could be last
> hour,
> > > last
> > > > 8 hours etc).
> > > >
> > > > So if I specify the same PK for each subscriber, then each consumer
> > will
> > > > get all messages in the correct order?  If I don't specify the PK or
> > use
> > > a
> > > > random one, will each consumer still get all the messages but they
> just
> > > > won't be ordered correctly?
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: understanding partition key

Reply via email to