Re: KIP-28 does not allow Processor to specify partition of output message

Guozhang Wang Wed, 14 Oct 2015 13:00:22 -0700

I agree that cluster-wide partitioning would be preferable in cases where
multiple producers from different services sharing the same topics, and
this may well be resolved by the same manner like the schema registry
service. On the other hand, I think this is not a problem that would be
solved at the KStream level which only considers per-job configs.


I think a KStream-layer partitioner in addition to the Producer partitioner
may not be a bad thing in that:

1) like Randall mentioned, with the producer Partitioner we basically need
to switch on ALL possible topics, and casting the Object key / value into
the proper types, and then do the partitioning logic; this is awkward
especially for cases that users only want to have customized partitioning
for a small subset of topics while being OK to leave the others to default
behavior (i.e. murmur hash on key). Extending the DefaultPartitioner may
partially solve this problem but not all.

2) with KStream-layer partitioner, it basically allows us to "overwrite"
the partitioning for probably a subset of partitions while leaving others
to the producer's default partitioner. This partitioner would be exposed to
users working on the lower-level processor API layer, while on the KStream
DSL layer we may well "infer" the partitioning scheme for most cases based
on join / aggregation specs, hence would not incur much burden for users.

Guozhang

On Wed, Oct 14, 2015 at 10:57 AM, Yasuhiro Matsuda <
[email protected]> wrote:

> A partitioning scheme should be a cluster wide thing. Letting each sink
> have a different partitioning scheme does not make sense to me. A
> partitioning scheme is not specific to a stream job, each task or a sink. I
> think specifying it at sink level is more error prone.
>
> If a user wants to customize a partitioning scheme, he/she also want to
> manage it at some central place, maybe a code repo, or a jar file. All
> application must use the same logic, otherwise data will be messed up.
> Thus, a single class representing all partitioning logic is not a bad thing
> at all. (The code organization wise, all logic does not necessarily in the
> single class, of course.)
>
>
> On Wed, Oct 14, 2015 at 8:47 AM, Randall Hauch <[email protected]> wrote:
>
>> Created https://issues.apache.org/jira/browse/KAFKA-2649 and attached a
>> PR with the proposed change.
>>
>> Thanks!
>>
>>
>> On October 14, 2015 at 3:12:34 AM, Guozhang Wang ([email protected])
>> wrote:
>>
>> Thanks!
>>
>> On Tue, Oct 13, 2015 at 9:34 PM, Randall Hauch <[email protected]> wrote:
>> Ok, cool. I agree we want something simple.  I'll create an issue and
>> create a pull request with a proposal. Look for it tomorrow.
>>
>> On Oct 13, 2015, at 10:25 PM, Guozhang Wang <[email protected]> wrote:
>>
>> I see your point. Yeah I think it is a good way to add a Partitioner into
>> addSink(...) but the Partitioner interface in producer is a bit overkill:
>>
>> "partition(String topic, Object key, byte[] keyBytes, Object value,
>> byte[] valueBytes, Cluster cluster)"
>>
>> whereas for us we only want to partition on (K key, V value).
>>
>> Perhaps we should add a new Partitioner interface in Kafka Streams?
>>
>> Guozhang
>>
>> On Tue, Oct 13, 2015 at 6:38 PM, Randall Hauch <[email protected]> wrote:
>> This overrides the partitioning logic for all topics, right? That means I
>> have to explicitly call the default partitioning logic for all topics
>> except those that my Producer forwards. I’m guess the best way to do by
>> extending org.apache.kafka.clients.producer.DefaultProducer. Of course,
>> with multiple sinks in my topology, I have to put all of the partitioning
>> logic inside a single class.
>>
>> What would you think about adding an overloaded
>> TopologyBuilder.addSink(…) method that takes a Partitioner (or better yet a
>> smaller functional interface). The resulting SinkProcessor could use that
>> Partitioner instance to set the partition number? That’d be super
>> convenient for users, would keep the logic where it belongs (where the
>> topology defines the sinks), and best of all the implementations won't have
>> to worry about any other topics, such as those used by stores, metrics, or
>> other sinks.
>>
>> Best regards,
>>
>> Randall
>>
>>
>> On October 13, 2015 at 8:09:41 PM, Guozhang Wang ([email protected])
>> wrote:
>>
>> Hi Randall,
>>
>> You can try to set the partitioner class as
>> ProducerConfig.PARTITIONER_CLASS_CONFIG in the StreamsConfig, its
>> interface
>> can be found in
>>
>> org.apache.kafka.clients.producer.Partitioner
>>
>> Let me know if it works for you.
>>
>> Guozhang
>>
>> On Tue, Oct 13, 2015 at 10:59 AM, Randall Hauch <[email protected]> wrote:
>>
>> > The new streams API added with KIP-28 is great. I’ve been using it on a
>> > prototype for a few weeks, and I’m looking forward to it being included
>> in
>> > 0.9.0. However, at the moment, a Processor implementation is not able to
>> > specify the partition number when it outputs messages.
>> >
>> > I’d be happy to log a JIRA and create a PR to add it to the API, but
>> > without knowing all of the history I’m wondering if leaving it out of
>> the
>> > API was intentional.
>> >
>> > Thoughts?
>> >
>> > Best regards,
>> >
>> > Randall Hauch
>> >
>>
>>
>>
>> --
>> -- Guozhang
>>
>>
>>
>> --
>> -- Guozhang
>>
>>
>>
>> --
>> -- Guozhang
>>
>
>


-- 
-- Guozhang

Re: KIP-28 does not allow Processor to specify partition of output message

Reply via email to