About implementation if we do the KIP as proposed: I agree with Guozhang that we would need to use the currently processed record's metadata in the context. This does leak some implementation details, but I personally don't see a big issue here (at the same time, I am also fine to remove the RecordContext for joins if people think it's an issue).
About the API: while I agree with Jan, that having two APIs for input streams/tables and "derived" streams/table (ie, result of KStream-KStream join or an aggregation) would be a way to avoid some semantic issue, I am not sure if it is worth the effort. IMHO, it would make the API more convoluted and if users access the RecordContext on a derived stream/table it's a "user error" -- it's not really wrong as users still get the current records context, but of course, we would leak implementation details (as above, I don't see a bit issue here though). At the same time, I disagree with Jan that "its not to hard to have a user keeping track" -- if we apply this argument, we could even argue that it's not to hard to use a Transformer instead of a map/filter etc. We want to add "syntactic sugar" with this change and thus should really provide value and not introduce a half-baked solution for which users still need to do manual customizing. -Matthias On 11/7/17 5:54 AM, Jan Filipiak wrote: > I Aggree completely. > > Exposing this information in a place where it has no _natural_ belonging > might really be a bad blocker in the long run. > > Concerning your first point. I would argue its not to hard to have a > user keep track of these. If we still don't want the user > to keep track of these I would argue that all > projection only < > transformations on a Source-backed KTable/KStream > could also return a Ktable/KStream instance of the type we return from > the topology builder. > Only after any operation that exceeds projection or filter one would > return a KTable not granting access to this any longer. > > Even then its difficult already: I never ran a topology with caching but > I am not even 100% sure what the record Context means behind > a materialized KTable with Caching? Topic and Partition are probably > with some reasoning but offset is probably only the offset causing the > flush? > So one might aswell think to drop offsets from this RecordContext. > > Best Jan > > > > > > > > On 07.11.2017 03:18, Guozhang Wang wrote: >> Regarding the API design (the proposed set of overloads v.s. one overload >> on #map to enrich the record), I think what we have represents a good >> trade-off between API succinctness and user convenience: on one hand we >> definitely want to keep as fewer overloaded functions as possible. But on >> the other hand if we only do that in, say, the #map() function then this >> enrichment could be an overkill: think of a topology that has 7 operators >> in a chain, where users want to access the record context on operator #2 >> and #6 only, with the "enrichment" manner they need to do the >> enrichment on >> operator #2 and keep it that way until #6. In addition, the RecordContext >> fields (topic, offset, etc) are really orthogonal to the key-value >> payloads >> themselves, so I think separating them into this object is a cleaner way. >> >> Regarding the RecordContext inheritance, this is actually a good point >> that >> have not been discussed thoroughly before. Here are my my two cents: one >> natural way would be to inherit the record context from the "triggering" >> record, for example in a join operator, if the record from stream A >> triggers the join then the record context is inherited from with that >> record. This is also aligned with the lower-level PAPI interface. A >> counter >> argument, though, would be that this is sort of leaking the internal >> implementations of the DSL, so that moving forward if we did some >> refactoring to our join implementations so that the triggering record can >> change, the RecordContext would also be different. I do not know how much >> it would really affect end users, but would like to hear your opinions. > Agreed to 100% exposing this information >> >> >> Guozhang >> >> >> On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <je.kari...@gmail.com> >> wrote: >> >>> Hi Jan, >>> >>> Sorry for late reply. >>> >>> >>> The API Design doesn't look appealing >>> >>> >>> In terms of API design we tried to preserve the java functional >>> interfaces. >>> We applied the same set of rich methods for KTable to make it compatible >>> with the rest of overloaded APIs. >>> >>> It should be 100% sufficient to offer a KTable + KStream that is >>> directly >>>> feed from a topic with 1 additional overload for the #map() methods to >>>> cover every usecase while keeping the API in a way better state. >>> >>> - IMO this seems a workaround, rather than a direct solution. >>> >>> Perhaps we should continue this discussion in DISCUSS thread. >>> >>> >>> Cheers, >>> Jeyhun >>> >>> >>> On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak <jan.filip...@trivago.com> >>> wrote: >>> >>>> Hi. >>>> >>>> I do understand that it might come in Handy. >>>> From my POV in any relational algebra this is only a projection. >>>> Currently we hide these "fields" that come with the input record. >>>> It should be 100% sufficient to offer a KTable + KStream that is >>>> directly >>>> feed from a topic with 1 additional overload for the #map() methods to >>>> cover every usecase while keeping the API in a way better state. >>>> >>>> best Jan >>>> >>>> On 06.11.2017 17:52, Matthias J. Sax wrote: >>>>> Jan, >>>>> >>>>> I understand what you are saying. However, having a RecordContext is >>>>> super useful for operations that are applied to input topic. Many >>>>> users >>>>> requested this feature -- it's much more convenient that falling back >>> to >>>>> transform() to implement a a filter() for example that want to access >>>>> some meta data. >>>>> >>>>> Because we cannot distinguish different "origins" of a KStream/KTable, >>> I >>>>> am not sure if there would be a better way to do this. The only >>>>> "workaround" I see, is to have two KStream/KTable interfaces each and >>> we >>>>> would use the first one for KStream/KTable with "proper" >>>>> RecordContext. >>>>> But this does not seem to be a good solution either. >>>>> >>>>> Note, a KTable can also be read directly from a topic, I agree that >>>>> using RecordContext on a KTable that is the result of an >>>>> aggregation is >>>>> questionable. But I don't see a reason to down vote the KIP for this >>>> reason. >>>>> WDYT about this? >>>>> >>>>> >>>>> -Matthias >>>>> >>>>> On 11/1/17 10:19 PM, Jan Filipiak wrote: >>>>>> -1 non binding >>>>>> >>>>>> I don't get the motivation. >>>>>> In 80% of my DSL processors there is no such thing as a reasonable >>>>>> RecordContext. >>>>>> After a join the record I am processing belongs to at least 2 >>>>>> topics. >>>>>> After a Group by the record I am processing was created from multiple >>>>>> offsets. >>>>>> >>>>>> The API Design doesn't look appealing >>>>>> >>>>>> Best Jan >>>>>> >>>>>> >>>>>> >>>>>> On 01.11.2017 22:02, Jeyhun Karimov wrote: >>>>>>> Dear community, >>>>>>> >>>>>>> It seems the discussion for KIP-159 [1] converged finally. I would >>>>>>> like to >>>>>>> initiate voting for the particular KIP. >>>>>>> >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>> 159%3A+Introducing+Rich+functions+to+Streams >>>>>>> >>>>>>> Cheers, >>>>>>> Jeyhun >>>>>>> >>>> >> >> >
signature.asc
Description: OpenPGP digital signature