Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Matthias J. Sax Tue, 07 Nov 2017 02:21:00 -0800

About implementation if we do the KIP as proposed: I agree with Guozhang
that we would need to use the currently processed record's metadata in
the context. This does leak some implementation details, but I
personally don't see a big issue here (at the same time, I am also fine
to remove the RecordContext for joins if people think it's an issue).


About the API: while I agree with Jan, that having two APIs for input
streams/tables and "derived" streams/table (ie, result of
KStream-KStream join or an aggregation) would be a way to avoid some
semantic issue, I am not sure if it is worth the effort. IMHO, it would
make the API more convoluted and if users access the RecordContext on a
derived stream/table it's a "user error" -- it's not really wrong as
users still get the current records context, but of course, we would
leak implementation details (as above, I don't see a bit issue here though).

At the same time, I disagree with Jan that "its not to hard to have a
user keeping track" -- if we apply this argument, we could even argue
that it's not to hard to use a Transformer instead of a map/filter etc.
We want to add "syntactic sugar" with this change and thus should really
provide value and not introduce a half-baked solution for which users
still need to do manual customizing.


-Matthias


On 11/7/17 5:54 AM, Jan Filipiak wrote:
> I Aggree completely.
> 
> Exposing this information in a place where it has no _natural_ belonging
> might really be a bad blocker in the long run.
> 
> Concerning your first point. I would argue its not to hard to have a
> user keep track of these. If we still don't want the user
> to keep track of these I would argue that all > projection only <
> transformations on a Source-backed KTable/KStream
> could also return a Ktable/KStream instance of the type we return from
> the topology builder.
> Only after any operation that exceeds projection or filter one would
> return a KTable not granting access to this any longer.
> 
> Even then its difficult already: I never ran a topology with caching but
> I am not even 100% sure what the record Context means behind
> a materialized KTable with Caching? Topic and Partition are probably
> with some reasoning but offset is probably only the offset causing the
> flush?
> So one might aswell think to drop offsets from this RecordContext.
> 
> Best Jan
> 
> 
> 
> 
> 
> 
> 
> On 07.11.2017 03:18, Guozhang Wang wrote:
>> Regarding the API design (the proposed set of overloads v.s. one overload
>> on #map to enrich the record), I think what we have represents a good
>> trade-off between API succinctness and user convenience: on one hand we
>> definitely want to keep as fewer overloaded functions as possible. But on
>> the other hand if we only do that in, say, the #map() function then this
>> enrichment could be an overkill: think of a topology that has 7 operators
>> in a chain, where users want to access the record context on operator #2
>> and #6 only, with the "enrichment" manner they need to do the
>> enrichment on
>> operator #2 and keep it that way until #6. In addition, the RecordContext
>> fields (topic, offset, etc) are really orthogonal to the key-value
>> payloads
>> themselves, so I think separating them into this object is a cleaner way.
>>
>> Regarding the RecordContext inheritance, this is actually a good point
>> that
>> have not been discussed thoroughly before. Here are my my two cents: one
>> natural way would be to inherit the record context from the "triggering"
>> record, for example in a join operator, if the record from stream A
>> triggers the join then the record context is inherited from with that
>> record. This is also aligned with the lower-level PAPI interface. A
>> counter
>> argument, though, would be that this is sort of leaking the internal
>> implementations of the DSL, so that moving forward if we did some
>> refactoring to our join implementations so that the triggering record can
>> change, the RecordContext would also be different. I do not know how much
>> it would really affect end users, but would like to hear your opinions.
> Agreed to 100% exposing this information
>>
>>
>> Guozhang
>>
>>
>> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <[email protected]>
>> wrote:
>>
>>> Hi Jan,
>>>
>>> Sorry for late reply.
>>>
>>>
>>> The API Design doesn't look appealing
>>>
>>>
>>> In terms of API design we tried to preserve the java functional
>>> interfaces.
>>> We applied the same set of rich methods for KTable to make it compatible
>>> with the rest of overloaded APIs.
>>>
>>> It should be 100% sufficient to offer a KTable + KStream that is
>>> directly
>>>> feed from a topic with 1 additional overload for the #map() methods to
>>>> cover every usecase while keeping the API in a way better state.
>>>
>>> - IMO this seems a workaround, rather than a direct solution.
>>>
>>> Perhaps we should continue this discussion in DISCUSS thread.
>>>
>>>
>>> Cheers,
>>> Jeyhun
>>>
>>>
>>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak <[email protected]>
>>> wrote:
>>>
>>>> Hi.
>>>>
>>>> I do understand that it might come in Handy.
>>>>   From my POV in any relational algebra this is only a projection.
>>>> Currently we hide these "fields" that come with the input record.
>>>> It should be 100% sufficient to offer a KTable + KStream that is
>>>> directly
>>>> feed from a topic with 1 additional overload for the #map() methods to
>>>> cover every usecase while keeping the API in a way better state.
>>>>
>>>> best Jan
>>>>
>>>> On 06.11.2017 17:52, Matthias J. Sax wrote:
>>>>> Jan,
>>>>>
>>>>> I understand what you are saying. However, having a RecordContext is
>>>>> super useful for operations that are applied to input topic. Many
>>>>> users
>>>>> requested this feature -- it's much more convenient that falling back
>>> to
>>>>> transform() to implement a a filter() for example that want to access
>>>>> some meta data.
>>>>>
>>>>> Because we cannot distinguish different "origins" of a KStream/KTable,
>>> I
>>>>> am not sure if there would be a better way to do this. The only
>>>>> "workaround" I see, is to have two KStream/KTable interfaces each and
>>> we
>>>>> would use the first one for KStream/KTable with "proper"
>>>>> RecordContext.
>>>>> But this does not seem to be a good solution either.
>>>>>
>>>>> Note, a KTable can also be read directly from a topic, I agree that
>>>>> using RecordContext on a KTable that is the result of an
>>>>> aggregation is
>>>>> questionable. But I don't see a reason to down vote the KIP for this
>>>> reason.
>>>>> WDYT about this?
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote:
>>>>>> -1 non binding
>>>>>>
>>>>>> I don't get the motivation.
>>>>>> In 80% of my DSL processors there is no such thing as a reasonable
>>>>>> RecordContext.
>>>>>> After a join  the record I am processing belongs to at least 2
>>>>>> topics.
>>>>>> After a Group by the record I am processing was created from multiple
>>>>>> offsets.
>>>>>>
>>>>>> The API Design doesn't look appealing
>>>>>>
>>>>>> Best Jan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote:
>>>>>>> Dear community,
>>>>>>>
>>>>>>> It seems the discussion for KIP-159 [1] converged finally. I would
>>>>>>> like to
>>>>>>> initiate voting for the particular KIP.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>> 159%3A+Introducing+Rich+functions+to+Streams
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jeyhun
>>>>>>>
>>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [VOTE] KIP-159: Introducing Rich functions to Streams

Reply via email to