Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Jan Filipiak Sun, 12 Nov 2017 13:23:13 -0800

Hi Gouzhang,

this felt like these questions are supposed to be answered by me.
I do not understand the first one. I don't understand why the user
shouldn't be able to specify a suffix for the topic name.

For the third question I am not 100% familiar if the Produced classcame to existence

at all. I remember proposing it somewhere in our redo DSL discussion that
I dropped out of later. Finally any call that does:

1. create the internal topic
2. register sink
3. register source

will always get the work done. If we have a Produced like class. puttingall the parametersin there make sense. (Partitioner, serde, PartitionHint, internal, name... )


Hope this helps?


On 10.11.2017 07:54, Guozhang Wang wrote:

A few clarification questions on the proposal details.

1. API: although the repartition only happens at the final stateful
operations like agg / join, the repartition flag info was actually passed
from an earlier operator like map / groupBy. So what should be the new API
look like? For example, if we do

stream.groupBy().through("topic-name", Produced..).aggregate

This would be add a bunch of APIs to GroupedKStream/KTable

2. Semantics: as Matthias mentioned, today any topics defined in
"through()" call is considered a user topic, and hence users are
responsible for managing them, including the topic name. For this KIP's
purpose, though, users would not care about the topic name. I.e. as a user
I still want to make it be an internal topic so that I do not need to worry
about it at all, but only specify num.partitions.

3. Details: in Produced we do not have specs for specifying the
num.partitions or should we repartition or not. So it is still not clear to
me how we would make use of that to achieve what's in the old
proposal's RepartitionHint class.



Guozhang


On Mon, Nov 6, 2017 at 1:21 PM, Ted Yu <[email protected]> wrote:

bq. enlarge the score of through()

I guess you meant scope.

On Mon, Nov 6, 2017 at 1:15 PM, Jeyhun Karimov <[email protected]>
wrote:

Hi,

Sorry for the late reply. I am convinced that we should enlarge the score
of through() (add more overloads) instead of introducing a separate set

of

overloads to other methods.
I will update the KIP soon based on the discussion and inform.


Cheers,
Jeyhun

On Mon, Nov 6, 2017 at 9:18 PM Jan Filipiak <[email protected]>
wrote:

Sorry for not beeing 100% up to date.
Back then we had the discussion that when an operation puts a >Sink<
into the topology, a >Produced<
parameter is added. This produced parameter could have internal or
external. If internal I think the name would still make
a great suffix for the topic name

Is this plan still around? Otherwise having the name as suffix is
probably always good it can help the user quicker to identify hot

topics

that need more
partitions if he has many of these internal repartitions

Best Jan


On 06.11.2017 20:13, Matthias J. Sax wrote:

I absolute agree with what you say. It's not a requirement to

specify a

topic name -- and this was the idea -- if user does specify a name,

we

treat as is -- if users does not specify a name, Streams create an
internal topic.

The goal of the Jira is to allow a simplified way to control
repartitioning (atm, user needs to manually create a topic and use

via

through()).

Thus, the idea is to make the topic name parameter of through

optional.

It's of course just an idea. Happy do have a other API design. The

goal

was, to avoid to many new overloads.

Could you clarify exactly what you mean by keeping the current

distinction?

Current distinction is: user topics are created manually and user
specifies the name -- internal topics are created by Kafka Streams

and

an name is generated automatically.

-> through("user-topic")
-> through(TopicConfig.withNumberOfPartitions(5)) // Streams creates

an

internal topic


-Matthias


On 11/6/17 6:56 PM, Thomas Becker wrote:

Could you clarify exactly what you mean by keeping the current

distinction?

Actually, re-reading the KIP and JIRA, it's not clear that being

able

to specify a custom name is actually a requirement. If the goal is to
control repartitioning and tune parallelism, maybe we can just sidestep
this issue altogether by removing the ability to set a different name.

On Mon, 2017-11-06 at 16:51 +0100, Matthias J. Sax wrote:

That's a good point. In current design, we strictly distinguish

both.

For example, the reset tools deletes internal topics (starting with
prefix `<application.id>-` and ending with either `-repartition` or
`-changelog`.

Thus, from my point of view, it would make sense to keep the current
distinction.

-Matthias

On 11/6/17 4:45 PM, Thomas Becker wrote:


I think this sounds good as well. It's worth clarifying whether

topics

that are named by the user but created by streams are considered

"internal"

topics also.

On Sun, 2017-11-05 at 23:02 +0100, Matthias J. Sax wrote:

My idea was, to relax the requirement for through() that a topic

must

be

created manually before startup.

Thus, if no through() call is made, a (internal) topic is created

the

same way we do it currently.

If one uses `through(String topicName)` we keep the current behavior

and

require users to create the topic manually.

The reasoning is as follows: if a user creates a topic manually, a

user

can just use it for repartitioning. As the topic is already there,

there

is no need to specify any topic configs.

We add a new `through()` overload (details TBD) that allows to

specify

topic configs and Streams create the topic with those configs.

Reasoning: user don't want to manage topic manually, thus, it's

still

an

internal topic and Streams create the topic name automatically as

for

all other internal topics. However, users gets some more control

about

topic parameters like number of partitions (we should discuss what

other

configs would be useful).


Does this make sense?


-Matthias


On 11/5/17 1:21 AM, Jan Filipiak wrote:


Hi.


Im not 100 % up to date what version 1.0 DSL looks like ATM.
I just would argue that repartitioning should be an own API call

like

through or something.
One can use through or to already to get this. I would argue one

should

look there instead of overloads

Best Jan

On 04.11.2017 16:01, Jeyhun Karimov wrote:


Dear community,

I would like to initiate discussion on KIP-221 [1] based on issue

[2].

Please feel free to comment.

[1]

https://cwiki.apache.org/confluence/display/KAFKA/KIP-

221%3A+Repartition+Topic+Hints+in+Streams

[2] https://issues.apache.org/jira/browse/KAFKA-6037



Cheers,
Jeyhun









________________________________

This email and any attachments may contain confidential and

privileged

material for the sole use of the intended recipient. Any review,

copying,

or distribution of this email (or any attachments) by others is

prohibited.

If you are not the intended recipient, please contact the sender
immediately and permanently delete this email and any attachments. No
employee or agent of TiVo Inc. is authorized to conclude any binding
agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
Inc. may only be made by a signed written agreement.






________________________________

This email and any attachments may contain confidential and

privileged

material for the sole use of the intended recipient. Any review,

copying,

or distribution of this email (or any attachments) by others is

prohibited.

If you are not the intended recipient, please contact the sender
immediately and permanently delete this email and any attachments. No
employee or agent of TiVo Inc. is authorized to conclude any binding
agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
Inc. may only be made by a signed written agreement.

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to