Re: Do we want to add more SMTs to Apache Kafka?

Joshua Grisham Sun, 21 Nov 2021 09:05:30 -0800

Hi all,

>From my perspective I think that the type of transformations which are
already covered by the existing SMTs is quite good (but anyone else please
say if you feel like you are missing something that feels "standard"), but
the biggest issue is the limitations that many of them have which makes
their usage extremely limited when trying to use them in a real production
scenario.

In my mind, the single biggest gap is the inability to handle nested fields
or anything more than records that essentially look like simple key-value
pairs. (However one exception being if you chain the flatten transform
first then you can apply others on the flattened result, but this is
assuming that the flatten transform can actually handle the message first!
If you have nested arrays then you are toast ;) And wait, maybe you didn't
actually want to flatten anyway?).

I am not sure the best way to approach this (e.g. allow for some kind of
path notation so users can address nested fields directly vs allow for
recursion to match a field name at no matter what level, or both, or
something else?) but I would say that some kind of standardized approach
that was implemented in all of the SMTs (where it makes sense) would
certainly be best! (at least, from a user perspective that the
configuration to address nested fields is consistent across each transform
that allows it).  I did this one way in a proposed change for KIP-683 but
this is only one of the possible ways (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-683%3A+Add+recursive+support+to+Connect+Cast+and+ReplaceField+transforms%2C+and+support+for+casting+complex+types+to+either+a+native+or+JSON+string
)

Past that, there are a few tweaks or enhancements which could be made to
some of the existing SMTs which would help prevent them from blocking or
failing for most general scenarios (for example some of the changes I had
proposed in the past but haven't since had the time to follow up on them
fit in this category I think), for example the ability to "cast" a more
complicated structure (such as an array) as a string (Connect API or JSON)
so the record can then be flattened and be inserted into a database table
or something similar will open up a lot of what is IMO currently roadblocks
that users might often hit in Sink scenarios.

Then there are some small tweaks which maybe can be made for specific
cases, some of which Randall already mentioned, such as:

* The Filter implementation is very limited to use mostly due to lack of
some "standard-feeling" predicates (field value filtering is very often
what I think people are looking for) so often the Confluent or other one is
used instead.
* A bit more can be done with InsertField IMO (e.g. giving a wallclock
timestamp instead of the record's produced timestamp is one example that
often seems to pop up).
* Some standardized way to "move" one field to another place e.g. to move
it out of or into a nested record.
* Limitations on only processing one field per transformation, e.g. with
the TimestampConverter like I had proposed with KIP-682 (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-682%3A+Connect+TimestampConverter+support+for+multiple+fields+and+multiple+input+formats)
are just a little annoying feeling and can add to processing time in high
volume scenarios.

(By the way apologies to Randall that I have not had a chance to get back
yet on KIP-682 but will try to do so in the discussion thread in the coming
days if I can!)

And finally I also feel like some of the SMTs are a bit disjointed from
each other when it comes to how the classes are actually designed and how
the configuration works when using them (both from a user implementing the
transform, and a transform developer perspective). Some of the class design
difference might be necessary due to the nature of the transformation
itself, but I wonder if in the future some kind of standardization could be
built into a type of base class or something instead, or some enhancements
to the requirements specified by the interface, which would help to drive a
more standardized approach?  Or maybe at least just a once-through on the
code for all of them to align things like how Config string constants/enums
etc are handled, method names and position within the code, that they are
all refactored in a similar way, etc.

In the end, I do feel it makes sense to try and sort of aim for the 80/20
rule with the standard SMTs to be able to support "real world" scenarios,
but some of these limitations cause them to fall a bit short today.

Hope this is helpful at least to spark other ideas anyway!

Have a nice (rest of the) weekend!
Joshua Grisham

Den lör 20 nov. 2021 kl 01:16 skrev Brandon Brown <bran...@bbrownsound.com>:

> I agree, if the desire is to keep the internal SMTs collection small then
> providing an ease of discovery like Gunnar suggestions would be extremely
> helpful.
>
> Brandon Brown
>
> > On Nov 19, 2021, at 6:13 PM, Gunnar Morling
> <gunnar.morl...@googlemail.com.invalid> wrote:
> >
> > Hi all,
> >
> > Just came across this thread, I hope the late reply is ok.
> >
> > FWIW, we're in a similar situation in Debezium, where users often request
> > new (Debezium-specific) SMTs, and we generally tend to recommend them to
> be
> > maintained by users themselves, unless they are truly generic. This
> > excludes a share of users though who aren't Java developers.
> >
> > What might help is having means of simple discoverability of externally
> > hosted SMTs, e.g. via some kind of catalog hosted on kafka.apache.org.
> That
> > way, people would have it easier to find and obtain SMTs from other
> places,
> > reducing the pressure to get them added to Apache Kafka proper.
> >
> > Best,
> >
> > --Gunnar
> >
> >
> >
> >
> >> Am So., 7. Nov. 2021 um 21:49 Uhr schrieb Brandon Brown <
> >> bran...@bbrownsound.com>:
> >>
> >> I like the idea of a select number of SMTs being offered and supported
> out
> >> of the box. The addition of SMTs via this process is nice because it
> allows
> >> for a rich set to be supported out of the box and without the need for
> >> extra work to deploy.
> >>
> >> Perhaps this is a spot where the community could express the interest of
> >> additional SMTs which maybe are available via an open source library
> and if
> >> enough usage occurs there could be a path to fold into the Kafka
> project at
> >> large?
> >>
> >> Brandon Brown
> >>
> >>
> >>>> On Nov 7, 2021, at 1:19 PM, Randall Hauch <rha...@gmail.com> wrote:
> >>>
> >>> We have had several requests to add more Connect Single Message
> >>> Transforms (SMTs) to the project. When SMTs were first introduced with
> >>> KIP-66 (ref 1) in Jun 2017, the KIP mentioned the following:
> >>>
> >>>> Criteria: SMTs that are shipped with Kafka Connect should be general
> >> enough to apply to many data sources & serialization formats. They
> should
> >> also be simple enough to not cause any additional library dependency to
> be
> >> introduced.
> >>>> Beyond those being initially included with this KIP, transformations
> >> can be adopted for inclusion in future with JIRA/ML discussion to weigh
> the
> >> tradeoffs.
> >>>
> >>> In the 4+ years that we've had SMTs in the project, we've only
> >>> enhanced the framework with KIP-585 (ref 2), and fixed the initial
> >>> SMTs (including KIP-437, ref 3). We recently have had quite a few
> >>> requests to add new SMTs; a few samples of these include:
> >>> * https://issues.apache.org/jira/browse/KAFKA-10299
> >>> * https://issues.apache.org/jira/browse/KAFKA-9436
> >>> * https://issues.apache.org/jira/browse/KAFKA-9318
> >>> * https://issues.apache.org/jira/browse/KAFKA-12443
> >>>
> >>> Adding new or changing existing SMTs to the Apache Kafka project come
> >>> with requirements. First, AK releases are infrequent and necessarily
> >>> involve the entire project. Second, adding an SMT is an API change and
> >>> therefore requires a KIP. Third, all changes in behavior to SMTs
> >>> included in an prior AK release must be backward compatible, and
> >>> adding or changing an SMT's configuration requires a KIP. This last
> >>> one is also challenging if we're limiting ourselves to truly general
> >>> SMTs, since these are notoriously difficult to get right the first
> >>> time. All of these aspects mean that it's difficult to add, maintain,
> >>> and evolve/improve SMTs in AK. And unless a bug fix is critical, we're
> >>> likely not to create a patch release for AK just to fix a bug in an
> >>> SMT, simply because of the effort involved.
> >>>
> >>> On the other hand, anyone can easily implement their own SMT and
> >>> deploy them as a Connect plugin, whether that's part of a connector
> >>> plugin or a separate plugin dedicated for one or SMTs. Interestingly,
> >>> it's far simpler to implement and maintain custom SMTs outside of AK,
> >>> especially since those plugins can be released and deployed in any
> >>> Connect runtime version since at least 0.11.0. And if custom SMTs are
> >>> maintained in a relatively small project, they can be released often.
> >>>
> >>> Finally, KIP-26 (ref 4) specifically rejected maintaining connector
> >>> implementations in the AK project. So we have precedence for choosing
> >>> not to accept implementations.
> >>>
> >>> Given the above, I wonder if the time has come for us to prefer only
> >>> maintaining the SMT framework and existing SMTs, and to decline adding
> >>> new SMTs.
> >>>
> >>> Thoughts?
> >>>
> >>> Best regards,
> >>>
> >>> Randall Hauch
> >>>
> >>> (1)
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect
> >>> (2)
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-585%3A+Filter+and+Conditional+SMTs
> >>> (3)
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-437%3A+Custom+replacement+for+MaskField+SMT
> >>> (4)
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >>
>

Re: Do we want to add more SMTs to Apache Kafka?

Reply via email to