Hi Jorge,

Thanks for addressing my comments; the KIP looks up-to-date and pretty
readable now, and the rejected alternatives section does a great job of
outlining the discussion so far and providing context for anyone else who
might want to join in.

1. Thoughts on choice of delimiter:
- I like the optimization for simple cases, but I think the new proposal is
a little too restrictive. What if there's a field whose name contains all
of the permitted options (currently just ".", ",", and "/")?
- If we expand the set of permitted delimiters to allow for any
single-character string, configuration complexity will increase and
readability may decrease
- Also worth pointing out that there is some convention for doubling a
delimiter character as an escape mechanism with formats like CSV [1]
- Overall I think we may be approaching the saturation point for productive
discussion on delimiter syntax so I don't want to spend too much more of
your time on it. I think the one point I'd like to leave for now is that it
shouldn't be impossible to use this new feature for any field name, no
matter how convoluted. It's fine if edge cases introduce difficulty (such
as less-readable configurations), but it's not fine if they can't be
addressed at all.

2.
The configuration style where you define "transforms.field.style" in the
connector config, and then this applies to all SMTs for the connector, is
very interesting. However, it doesn't follow convention for existing SMTs.
Right now, if you want to configure an SMT, you define its name in the
connector config (for example, "transforms": "smt1"), and then define all
of the properties for that SMT in the connector config using a namespacing
mechanism specific to that SMT (for example, "transforms.smt1.prop1":
"val1"). That SMT then sees only the properties defined in that namespace,
with the prefix stripped (for example, "prop1": "val1") in its configure
[2] [3] method.
If we want to continue to follow this convention, then instead of
specifying "transforms.field.style" in a connector config, we would expect
users to configure "transforms.<name>.field.style", for each SMT that they
want to configure a field style for. This would require more work on the
part of the user, but would be simpler to reason about and easier to
implement.
If we want to explore an alternative where users can specify global
properties that apply to all transforms in a connector config, then the
semantics for this need to be defined in the KIP. This would have to
include whether this will apply only for the special case of the
"field.style" and possibly "field.separator" properties or if it would be
available more generally for other properties, whether it will apply only
for the SMTs outlined in the KIP or if the "field.style" and possibly
"field.separator" properties would also be passed into custom SMTs so that
they could choose to act on them if applicable, how edge cases like having
an SMT named "field" in your connector config would be handled, etc.
Either way, it might help to have an example in the KIP outlining how one
of the to-be-augmented SMTs can be configured with this new feature and a
before/after of how a record value would be transformed with that
configuration.

3. The docstring for the "transforms.field.style" property mentions that
the permitted values are "plain" and "nested", but then describes behavior
with a value of "root". Should that be "plain" instead?

4. The docstring for the "transforms.field.separator" property exclusively
mentions structs, but the feature is intended to work with maps as well.
Can we update it to reflect this?

References:

[1] - https://stackoverflow.com/a/17808731
[2] -
https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java#L30
[3] -
https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/clients/src/main/java/org/apache/kafka/common/Configurable.java#L26-L29

Cheers,

Chris

On Mon, Mar 28, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris!
>
> 1. I'd argue "this..field.child" could be harder to grasp than
> "this.field/child" + separator: "/".
> Even though this represents additional information, it follows a similar
> approach as the "Flatten#delimeter" configuration.
> I want to give the separator approach another try, so I have updated the
> KIP with the separator proposal, sticking to only 2 alternatives that
> should hopefully cover most scenarios.
>
> 2. Agree. KIP has been updated with this improvement.
>
> 3. You're right. I have updated this section accordingly.
>
> 4. Good catch! I've replaced it with `DropHeaders`.
>
> Looking forward to your feedback.
>
> Thanks,
> Jorge.
>
> On Wed, 9 Mar 2022 at 21:33, Chris Egerton <fearthecel...@gmail.com>
> wrote:
>
> > Hi Jorge,
> >
> > Looking good! Got a few more thoughts.
> >
> > 1. Sorry to revisit this, but I think we may want to adopt a slightly
> > different escape syntax style. Backslashes are great, but since they're
> > already used by JSON, using them as an escape sequence in field notation
> > would also lead to some pretty ugly connector configs. Anyone who's had
> to
> > write regular expressions with backslashes in Java is probably already
> > familiar with this: "this\\\\.is\\\\.not\\\\.very\\\\.readable". What do
> > you think about using the dot character to escape itself? In other words,
> > to access a single field named "this.field", instead of using the syntax
> > "this\.field" (which in JSON would have to be expressed as
> "this\\.field"),
> > we could use "this..field", and for a single field named "this\field",
> > instead of using the syntax "this\\field" (or, in JSON, "this\\\\field"),
> > we could use "this\field" (or, in JSON, "this\\field").
> >
> > 2. Could you flesh out the details on the new "field.style" property,
> > including the type, default value, importance, and a preliminary
> docstring?
> > See
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-618%3A+Exactly-Once+Support+for+Source+Connectors#KIP618:ExactlyOnceSupportforSourceConnectors-Newproperties
> > for an example.
> >
> > 3. Is the "Compatibility, Deprecation, and Migration Plan" section still
> > accurate after the latest update? Seems like it's still written with the
> > assumption that nested field syntax will be hardcoded or opt-in, which
> IIUC
> > isn't the case anymore.
> >
> > 4. Nit: The "These SMTs do not require nested structure support" section
> > mentions a "Drop" SMT. I think this may be referring to the Confluent
> Drop
> > SMT, which isn't a part of Apache Kafka. Should we drop (heh) that SMT
> from
> > the list? Or perhaps just replace it with "DropHeaders", which is
> currently
> > missing from the list and shouldn't require any nested-field related
> > updates?
> >
>
> > Cheers,
> >
> > Chris
> >
> > On Mon, Feb 28, 2022 at 2:12 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thank you, Chris! and sorry for the delayed response.
> > >
> > > Please, find my comments below:
> > >
> > > On Mon, 14 Feb 2022 at 17:34, Chris Egerton
> <chr...@confluent.io.invalid
> > >
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP! I'd love to see support for nested fields added
> to
> > > the
> > > > out-of-the-box SMTs provided with Connect. Here are my initial
> > thoughts:
> > > >
> > > > 1. I agree that there's a case to be made for expanding HoistField
> > with a
> > > > new config property for identifying a nested, to-be-hoisted field,
> but
> > > the
> > > > example in the KIP doesn't really demonstrate why this would be
> > > valuable. I
> > > > think it'd be helpful to expand the example to add other fields in
> > order
> > > to
> > > > show how adding nested field support enables users to hoist a nested
> > > field
> > > > without dropping other fields from the value. Maybe something like
> > this:
> > > >
> > > >     source = nested.val
> > > >     field = line
> > > >
> > > >     value (before):
> > > >         {
> > > >             "nested": {
> > > >                 "val": 42,
> > > >                 "other val": 96
> > > >             }
> > > >         }
> > > >
> > > >     value (after):
> > > >         {
> > > >             "nested": {
> > > >                 "line": {
> > > >                     "val": 42,
> > > >                 }
> > > >                 "other val": 96
> > > >             }
> > > >         }
> > > >
> > > > 2. Nit: I think "source" is a little strange for the new HoistField
> > > > property name. Maybe "hoisted" or "hoisted.field" would be more
> > > > descriptive?
> > > >
> > > >
> > > About 1. and 2.:
> > > Agree. The example for this SMT is updated and have added the `hoisted`
> > > configuration.
> > >
> > >
> > > > 3. Is there a reasonable use case for expanding Flatten to be able to
> > > > flatten specific fields? My understanding is that it's mostly useful
> > for
> > > > writing to systems like databases that don't support nested values
> and
> > > > require everything to be a flat list of key-value pairs. Being able
> to
> > > > flatten a nested field wouldn't provide any advantage for that use
> > case.
> > > > Are there other cases where it would?
> > > >
> > > > 4. I don't think we should unconditionally change the default
> delimiter
> > > for
> > > > Flatten. It's a backwards-incompatible, breaking change that could
> > cause
> > > > headaches for users. It might be reasonable to change the default
> value
> > > > dynamically based on whether the user has specified a value for the
> > > "field"
> > > > property, but considering the motivation for changing the default is
> > that
> > > > it creates conflicts with the to-be-introduced nested field syntax
> > (which
> > > > could arise with downstream SMTs regardless of whether the user has
> > > > explicitly configured Flatten with the "field" property), I don't
> know
> > > that
> > > > this would be too useful either. I have some thoughts below on how to
> > > > handle possible conflicts between names with dots in their names and
> > > dotted
> > > > syntax for nested field references that should hopefully make either
> > > change
> > > > unnecessary.
> > > >
> > > >
> > > Fair enough. With the support for nested fields in other SMTs, Flatten
> > > could stay as it is.
> > > This removes the need for (4) changing Flatten config as well.
> > >
> > >
> > > > 5. I think it's fine to expand ExtractField to support nested
> notation,
> > > but
> > > > it might be worth noting in the rejected alternatives section that
> this
> > > > isn't strictly necessary since you can replace any single invocation
> of
> > > > that SMT that uses nested field notation with multiple invocations of
> > it
> > > > that use non-nested notation.
> > > >
> > >
> > > Agree. Adding it.
> > >
> > >
> > > >
> > > > 6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs
> that
> > do
> > > > not require nested structure support.
> > > >
> > > >
> > > Ack. Fixing it.
> > >
> > >
> > > > 7. It may be rare for dots in field names to occur in the wild
> > (although
> > > I
> > > > wouldn't be so certain of this), but unless we want to inflict
> > headaches
> > > on
> > > > users of Flatten, I think we're going to have to think about
> conflicts
> > > > between dotted notation and non-nested fields whose names contain
> > dots. I
> > > > don't think this is actually such a bad thing, though. I agree that
> > > dotted
> > > > notation is intuitive and pretty commonplace (in tools like jq, for
> > > > example), so I'd like it if we could stick to it. What about
> > introducing
> > > > escape syntax, using a backslash? That way, users could disambiguate
> > > > between "this.field" (which would refer to the nested field "field"
> > under
> > > > the top-level "this" field), and "this\.field" (which would refer to
> > the
> > > > field named "this.field"). Like with most languages that use the
> > > backslash
> > > > for escape sequences, it could also be used to escape itself, in the
> > > event
> > > > that a field name contains a backslash. I think this is more
> intuitive
> > > and
> > > > simpler than, e.g., adding a new config property to toggle the
> > delimiter
> > > to
> > > > be used when parsing nested field references.
> > > >
> > >
> > > I like this approach. Adding to the KIP.
> > >
> > >
> > > >
> > > > 8. I don't think we can unconditionally turn this feature on. The
> risk
> > of
> > > > breaking existing pipelines (especially ones that involve, for
> > example, a
> > > > combination of the Flatten and Cast SMTs) is pretty high. I think
> this
> > > > should be an opt-in feature, at least until the next major release.
> One
> > > way
> > > > we could accomplish this is by introducing a new "field.style" (name
> > > > obviously subject to change) property with values of "plain"
> (default)
> > > and
> > > > "nested". If set to "plain" then the current non-nested behavior is
> > used,
> > > > and if set to "nested", then the proposed nested behavior is. We can
> > > > consider updating the default value to "nested" in a future major
> > release
> > > > (or even codify that plan in this KIP, if there's enough support for
> > it).
> > > > This would also leave the door open for adding recursive support to
> > SMTs
> > > in
> > > > the future by adding a permitted value of "recursive".
> > > >
> > > > 9. One of the linked tickets in the "Motivation" section,
> > > > https://issues.apache.org/jira/browse/KAFKA-10640, has an open PR
> and
> > > KIP
> > > > that propose adding recursive support to some SMTs. Have you
> considered
> > > > this type of functionality for your KIP? Or is your aim to stick
> solely
> > > to
> > > > nested field support?
> > > >
> > >
> > > I like the `field.style` configuration flag approach.
> > > Thanks for pointing out the `recursive` approach. Will add `nested` at
> > the
> > > moment, let's check the demand for `recursive` to consider it as part
> of
> > > this or another KIP.
> > >
> > > I have added the following on the KIP:
> > >
> > > ```
> > > Future KIPs could extend this support for:
> > >
> > > - Recursive notation: name a field and apply it to all fields across
> the
> > > schema matching that name.
> > > - Access to arrays: Adding `[]` notation to represent access to arrays
> > and
> > > applying SMTs to fields within an array.
> > > ```
> > >
> > >
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Feb 8, 2022 at 1:23 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jo...@gmail.com> wrote:
> > > >
> > > > > Hi Dev team,
> > > > >
> > > > > I'd like to start a new discussion thread on Kafka Connect KIP-821:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures
> > > > >
> > > > > This KIP is aimed to include support for nested structures on the
> > > > existing
> > > > > SMTs — where this make sense.
> > > > >
> > > > > Looking forward to your feedback.
> > > > >
> > > > > Regards,
> > > > > Jorge.
> > > > >
> > > >
> > >
> >
>

Reply via email to