Re: Questions about partitioning

2015-04-24 Thread Yi Pan
Hi, Susan,

Welcome to Samza!

First I will try to answer your question about partition assignment in
Samza. The assignment from stream partition to Samza tasks is determined by
the SystemStreamPartitionGrouper. The default implementation include two
assignment methods: 1 task per system stream partition, and 1 task per
partition. Example:  your have 2 streams (e.g. 2 Kafka topics), each has 2
partitions: s0.{p0,p1} and s1.{p0,p1}. In 1 task per system stream
partition assignment, you will have 4 tasks corresponding to the following
4 system stream partitions: {s0.p0, s0.p1, s1.p0, s1.p1}; in 1 task per
partition assignment, you will have 2 tasks with the following partition
groups: {(s0.p0, s1.p0), (s0.p1, s1.p1)}.

In your example, if your messages are partitioned by client ID, the
messages w/ the same client ID will be in the single partition. If you
follow Naveen's suggestion, each system stream partition will be assigned
to a dedicated Samza task, no matter whether you choose to have 10K topic
or 10K partitions. Note that if you choose the 1 topic w/ 10K partition,
whether clientID A and clientID B are handled by the same Samza task or not
depends on whether clientID A and B are partitioned in two different Kafka
partitions or not.

As for Kafka performance, if you choose to use 10K topics, you will need to
be careful about ZooKeeper capacity, since the number of watches on the
znodes is a lot.

Lastly, one comment on your use case: why 10K partitions? If you only want
to have per clientID state in your stream processing, you can have 10
partitions each include 1K clientID and your Samza local KV-store can have
states keyed by clientID to satisfy your use case. (I saw Jakob has the
same comments when I am typing. :) )

Thanks!

-Yi

On Fri, Apr 24, 2015 at 3:29 PM, Susan Luong  wrote:

> Hi there, I'm new to Samza/Kafka and we're evaluating Samza to see whether
> it would be a good fit for our application. I just had a few questions
> about how partitioning works.
>
> I understand there is a limitation on the number of topics we can create
> [1], and I was wondering, if we need more than, say 10K topics, would it be
> a better idea to use partitioning instead? or would the same limits apply?
> i.e. would having 1 topic with 10k partitions produce the same performance
> issues as having 10k topics with 1 partition each?
>
> If we can overcome the topics limitation by creating more partitions, we'd
> like to be able to divide up our stream messages by client ID. is it
> possible to group partitions so that we have a set of partitions that
> contain data from a certain client and another set of partitions for
> another client, within the same topic?
>
> For example, we might have a stream partition 'A' (for clientID A) and a
> corresponding task 'a' that processes messages from partition 'A', and a
> partition B (for client B) and a corresponding task, 'b' that processes
> messages from stream partition 'B'. Our problem though, is that, we'd like
> for task 'a' to only process messages from stream A and never from stream
> B, since task 'a' may contain local state that applies specifically to
> stream A. Would this be possible?
>
> Maybe I'm not understanding how Samza works, but I'm hoping someone can
> help me clarify. Thanks in advance for your help.
>
> Susan
>
>
>
> [1]
> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
>


Re: Questions about partitioning

2015-04-24 Thread Jakob Homan
Hey Susan-
  That volume of topics (or partitions) would be a significant burden
on both the Kafka cluster and underlying YARN cluster (for the Samza
job).  A 'large number of partitions' even at places with huge Kafka
clusters is on the order of 512 or so.  It sounds like you're trying
to use partitions as a means of isolation, rather than as a means of
load balancing.  In your example, for instance, if the clientID == the
partition, when no requests from a specific client is coming in, the
Kafka partition won't be written to and the Samza job will be doing
nothing.  This would lead to poor utilization and lots of idle Samza
tasks.

Partitions (generally, this is all pluggable) rather are meant to for
dividing up the work in a even manner, so that one partition may
indeed handle multiple client ids in your example.  Is there any
specific reason you don't want to comingle the clientids?  The local
state can be keyed off the clientid and retrieved/mutated from this
value.

Thanks,
Jakob


On 24 April 2015 at 16:40, Naveen S  wrote:
> Hey Susan,
>  As far as I know, there is very minimal differences
> between Partition vs Topic strategy in terms of performance - in terms of
> how they are allocated in the memory they should be very similar, but I'll
> get some Kafka experts to comment on that.
>
> From Samza's perspective, if you choose to go with multiple partitions. You
> can write a Samza job which will repartition the stream as exactly you
> described, peek into the clientID from the stream event and send it to the
> corresponding partition [1]. You can have a second job, with each Task
> processing information from one partition (which will correspond to events
> from one clientID). In the implementation, there will be a one-to-one
> mapping between the Task and the Partition.
>
>
> [1]
> http://samza.apache.org/learn/documentation/0.9/api/javadocs/org/apache/samza/system/OutgoingMessageEnvelope.html
>
> Thanks,
> Naveen
>
> On Fri, Apr 24, 2015 at 3:29 PM, Susan Luong  wrote:
>
>> Hi there, I'm new to Samza/Kafka and we're evaluating Samza to see whether
>> it would be a good fit for our application. I just had a few questions
>> about how partitioning works.
>>
>> I understand there is a limitation on the number of topics we can create
>> [1], and I was wondering, if we need more than, say 10K topics, would it be
>> a better idea to use partitioning instead? or would the same limits apply?
>> i.e. would having 1 topic with 10k partitions produce the same performance
>> issues as having 10k topics with 1 partition each?
>>
>> If we can overcome the topics limitation by creating more partitions, we'd
>> like to be able to divide up our stream messages by client ID. is it
>> possible to group partitions so that we have a set of partitions that
>> contain data from a certain client and another set of partitions for
>> another client, within the same topic?
>>
>> For example, we might have a stream partition 'A' (for clientID A) and a
>> corresponding task 'a' that processes messages from partition 'A', and a
>> partition B (for client B) and a corresponding task, 'b' that processes
>> messages from stream partition 'B'. Our problem though, is that, we'd like
>> for task 'a' to only process messages from stream A and never from stream
>> B, since task 'a' may contain local state that applies specifically to
>> stream A. Would this be possible?
>>
>> Maybe I'm not understanding how Samza works, but I'm hoping someone can
>> help me clarify. Thanks in advance for your help.
>>
>> Susan
>>
>>
>>
>> [1]
>> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
>>


Re: Questions about partitioning

2015-04-24 Thread Naveen S
Hey Susan,
 As far as I know, there is very minimal differences
between Partition vs Topic strategy in terms of performance - in terms of
how they are allocated in the memory they should be very similar, but I'll
get some Kafka experts to comment on that.

>From Samza's perspective, if you choose to go with multiple partitions. You
can write a Samza job which will repartition the stream as exactly you
described, peek into the clientID from the stream event and send it to the
corresponding partition [1]. You can have a second job, with each Task
processing information from one partition (which will correspond to events
from one clientID). In the implementation, there will be a one-to-one
mapping between the Task and the Partition.


[1]
http://samza.apache.org/learn/documentation/0.9/api/javadocs/org/apache/samza/system/OutgoingMessageEnvelope.html

Thanks,
Naveen

On Fri, Apr 24, 2015 at 3:29 PM, Susan Luong  wrote:

> Hi there, I'm new to Samza/Kafka and we're evaluating Samza to see whether
> it would be a good fit for our application. I just had a few questions
> about how partitioning works.
>
> I understand there is a limitation on the number of topics we can create
> [1], and I was wondering, if we need more than, say 10K topics, would it be
> a better idea to use partitioning instead? or would the same limits apply?
> i.e. would having 1 topic with 10k partitions produce the same performance
> issues as having 10k topics with 1 partition each?
>
> If we can overcome the topics limitation by creating more partitions, we'd
> like to be able to divide up our stream messages by client ID. is it
> possible to group partitions so that we have a set of partitions that
> contain data from a certain client and another set of partitions for
> another client, within the same topic?
>
> For example, we might have a stream partition 'A' (for clientID A) and a
> corresponding task 'a' that processes messages from partition 'A', and a
> partition B (for client B) and a corresponding task, 'b' that processes
> messages from stream partition 'B'. Our problem though, is that, we'd like
> for task 'a' to only process messages from stream A and never from stream
> B, since task 'a' may contain local state that applies specifically to
> stream A. Would this be possible?
>
> Maybe I'm not understanding how Samza works, but I'm hoping someone can
> help me clarify. Thanks in advance for your help.
>
> Susan
>
>
>
> [1]
> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
>