Hi Jorge,

Looking good! I have a few comments left but all but one or two are minor.

1. The motivation section states "This KIP is aimed to include support for
nested structures on the existing SMTs... and to include the abstractions
to reuse this in future SMTs". A good implementation of this KIP will
definitely isolate reusable logic into a separate abstraction that can be
easily pulled in to the SMTs we want to add nested field support to, but
unless we plan on making this kind of abstraction publicly available as
some kind of utility method or class that external SMT developers can
leverage, we can probably leave this part out as it's more of an
implementation detail.

2. The Cast example is a little misleading, isn't it? It demonstrates the
escape syntax for fields with dot literals in their names, but it doesn't
demonstrate a way to actually use the Cast (or any other) SMT to access a
nested field in a record, which is the whole point of the KIP. I like the
example of escape syntax but we should probably also add one for nested
field access.

3. With the InsertField SMT, I'm wondering what the specific behavior will
be when some or all of the "middle layer" nested fields are missing. For
example, if I have a record with a value of { "k1": "v1 } and I apply
InsertField with topic.field = n1.n2.n3.topic, what will happen? Will the
updated value become { "k1": "v1", "n1": { "n2": { "n3": "topic" }}}, will
an exception be thrown, or something else? This seems analogous to the
command line mkdir command, which (at least on some operating systems)
fails by default if you try to create a new nested directory where anything
but the last element in the path doesn't exist, but can be invoked with a
flag that instructs it to go ahead and create all levels of nested
directory instead. I'm leaning on the side of "just create everything" but
would be interested in your thoughts, and either way, we should probably
make sure the intended behavior is well-defined before voting.

4. Similarly, what will the behavior be if any of the field elements
specified with InsertField already exist in the record value? Will we just
overwrite them? What's the behavior of InsertField today under similar
circumstances?

Cheers,

Chris

On Thu, Mar 31, 2022 at 4:15 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris! Much appreciated all the feedback here.
>
> 1. You nailed it setting the design goal here: "it shouldn't be impossible
> to use this new feature for any field name, no matter how convoluted. It's
> fine if edge cases introduce difficulty (such as less-readable
> configurations), but it's not fine if they can't be addressed at all."
> Back to the previous proposals (using only dots as separators) we have 2
> alternatives:
> 1. escaping with backslashes
> 2. escaping with dots itself
>
> I'll lean for alternative 2, as you proposed before. Feels to me that
> backslashes have more potential to lead to confusion in JSON configs, and
> CSV seems like a good precedent to use the same character to escape itself.
> KIP is updated to reflect this.
>
> 2. Thanks! I'll add an example, and stick with the current approach
> defining the style per individual transform configuration.
>
> 3. Yes, thanks! KIP updated.
>
> 4. Of course. KIP updated.
>
> On Mon, 28 Mar 2022 at 21:59, Chris Egerton <fearthecel...@gmail.com>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for addressing my comments; the KIP looks up-to-date and pretty
> > readable now, and the rejected alternatives section does a great job of
> > outlining the discussion so far and providing context for anyone else who
> > might want to join in.
> >
> > 1. Thoughts on choice of delimiter:
> > - I like the optimization for simple cases, but I think the new proposal
> is
> > a little too restrictive. What if there's a field whose name contains all
> > of the permitted options (currently just ".", ",", and "/")?
> > - If we expand the set of permitted delimiters to allow for any
> > single-character string, configuration complexity will increase and
> > readability may decrease
> > - Also worth pointing out that there is some convention for doubling a
> > delimiter character as an escape mechanism with formats like CSV [1]
> > - Overall I think we may be approaching the saturation point for
> productive
> > discussion on delimiter syntax so I don't want to spend too much more of
> > your time on it. I think the one point I'd like to leave for now is that
> it
> > shouldn't be impossible to use this new feature for any field name, no
> > matter how convoluted. It's fine if edge cases introduce difficulty (such
> > as less-readable configurations), but it's not fine if they can't be
> > addressed at all.
> >
> > 2.
> > The configuration style where you define "transforms.field.style" in the
> > connector config, and then this applies to all SMTs for the connector, is
> > very interesting. However, it doesn't follow convention for existing
> SMTs.
> > Right now, if you want to configure an SMT, you define its name in the
> > connector config (for example, "transforms": "smt1"), and then define all
> > of the properties for that SMT in the connector config using a
> namespacing
> > mechanism specific to that SMT (for example, "transforms.smt1.prop1":
> > "val1"). That SMT then sees only the properties defined in that
> namespace,
> > with the prefix stripped (for example, "prop1": "val1") in its configure
> > [2] [3] method.
> > If we want to continue to follow this convention, then instead of
> > specifying "transforms.field.style" in a connector config, we would
> expect
> > users to configure "transforms.<name>.field.style", for each SMT that
> they
> > want to configure a field style for. This would require more work on the
> > part of the user, but would be simpler to reason about and easier to
> > implement.
> > If we want to explore an alternative where users can specify global
> > properties that apply to all transforms in a connector config, then the
> > semantics for this need to be defined in the KIP. This would have to
> > include whether this will apply only for the special case of the
> > "field.style" and possibly "field.separator" properties or if it would be
> > available more generally for other properties, whether it will apply only
> > for the SMTs outlined in the KIP or if the "field.style" and possibly
> > "field.separator" properties would also be passed into custom SMTs so
> that
> > they could choose to act on them if applicable, how edge cases like
> having
> > an SMT named "field" in your connector config would be handled, etc.
> > Either way, it might help to have an example in the KIP outlining how one
> > of the to-be-augmented SMTs can be configured with this new feature and a
> > before/after of how a record value would be transformed with that
> > configuration.
> >
> > 3. The docstring for the "transforms.field.style" property mentions that
> > the permitted values are "plain" and "nested", but then describes
> behavior
> > with a value of "root". Should that be "plain" instead?
> >
> > 4. The docstring for the "transforms.field.separator" property
> exclusively
> > mentions structs, but the feature is intended to work with maps as well.
> > Can we update it to reflect this?
> >
> > References:
> >
> > [1] - https://stackoverflow.com/a/17808731
> > [2] -
> >
> >
> https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java#L30
> > [3] -
> >
> >
> https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/clients/src/main/java/org/apache/kafka/common/Configurable.java#L26-L29
> >
> > Cheers,
> >
> > Chris
> >
> > On Mon, Mar 28, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thanks, Chris!
> > >
> > > 1. I'd argue "this..field.child" could be harder to grasp than
> > > "this.field/child" + separator: "/".
> > > Even though this represents additional information, it follows a
> similar
> > > approach as the "Flatten#delimeter" configuration.
> > > I want to give the separator approach another try, so I have updated
> the
> > > KIP with the separator proposal, sticking to only 2 alternatives that
> > > should hopefully cover most scenarios.
> > >
> > > 2. Agree. KIP has been updated with this improvement.
> > >
> > > 3. You're right. I have updated this section accordingly.
> > >
> > > 4. Good catch! I've replaced it with `DropHeaders`.
> > >
> > > Looking forward to your feedback.
> > >
> > > Thanks,
> > > Jorge.
> > >
> > > On Wed, 9 Mar 2022 at 21:33, Chris Egerton <fearthecel...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Looking good! Got a few more thoughts.
> > > >
> > > > 1. Sorry to revisit this, but I think we may want to adopt a slightly
> > > > different escape syntax style. Backslashes are great, but since
> they're
> > > > already used by JSON, using them as an escape sequence in field
> > notation
> > > > would also lead to some pretty ugly connector configs. Anyone who's
> had
> > > to
> > > > write regular expressions with backslashes in Java is probably
> already
> > > > familiar with this: "this\\\\.is\\\\.not\\\\.very\\\\.readable". What
> > do
> > > > you think about using the dot character to escape itself? In other
> > words,
> > > > to access a single field named "this.field", instead of using the
> > syntax
> > > > "this\.field" (which in JSON would have to be expressed as
> > > "this\\.field"),
> > > > we could use "this..field", and for a single field named
> "this\field",
> > > > instead of using the syntax "this\\field" (or, in JSON,
> > "this\\\\field"),
> > > > we could use "this\field" (or, in JSON, "this\\field").
> > > >
> > > > 2. Could you flesh out the details on the new "field.style" property,
> > > > including the type, default value, importance, and a preliminary
> > > docstring?
> > > > See
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-618%3A+Exactly-Once+Support+for+Source+Connectors#KIP618:ExactlyOnceSupportforSourceConnectors-Newproperties
> > > > for an example.
> > > >
> > > > 3. Is the "Compatibility, Deprecation, and Migration Plan" section
> > still
> > > > accurate after the latest update? Seems like it's still written with
> > the
> > > > assumption that nested field syntax will be hardcoded or opt-in,
> which
> > > IIUC
> > > > isn't the case anymore.
> > > >
> > > > 4. Nit: The "These SMTs do not require nested structure support"
> > section
> > > > mentions a "Drop" SMT. I think this may be referring to the Confluent
> > > Drop
> > > > SMT, which isn't a part of Apache Kafka. Should we drop (heh) that
> SMT
> > > from
> > > > the list? Or perhaps just replace it with "DropHeaders", which is
> > > currently
> > > > missing from the list and shouldn't require any nested-field related
> > > > updates?
> > > >
> > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Mon, Feb 28, 2022 at 2:12 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jo...@gmail.com> wrote:
> > > >
> > > > > Thank you, Chris! and sorry for the delayed response.
> > > > >
> > > > > Please, find my comments below:
> > > > >
> > > > > On Mon, 14 Feb 2022 at 17:34, Chris Egerton
> > > <chr...@confluent.io.invalid
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the KIP! I'd love to see support for nested fields
> added
> > > to
> > > > > the
> > > > > > out-of-the-box SMTs provided with Connect. Here are my initial
> > > > thoughts:
> > > > > >
> > > > > > 1. I agree that there's a case to be made for expanding
> HoistField
> > > > with a
> > > > > > new config property for identifying a nested, to-be-hoisted
> field,
> > > but
> > > > > the
> > > > > > example in the KIP doesn't really demonstrate why this would be
> > > > > valuable. I
> > > > > > think it'd be helpful to expand the example to add other fields
> in
> > > > order
> > > > > to
> > > > > > show how adding nested field support enables users to hoist a
> > nested
> > > > > field
> > > > > > without dropping other fields from the value. Maybe something
> like
> > > > this:
> > > > > >
> > > > > >     source = nested.val
> > > > > >     field = line
> > > > > >
> > > > > >     value (before):
> > > > > >         {
> > > > > >             "nested": {
> > > > > >                 "val": 42,
> > > > > >                 "other val": 96
> > > > > >             }
> > > > > >         }
> > > > > >
> > > > > >     value (after):
> > > > > >         {
> > > > > >             "nested": {
> > > > > >                 "line": {
> > > > > >                     "val": 42,
> > > > > >                 }
> > > > > >                 "other val": 96
> > > > > >             }
> > > > > >         }
> > > > > >
> > > > > > 2. Nit: I think "source" is a little strange for the new
> HoistField
> > > > > > property name. Maybe "hoisted" or "hoisted.field" would be more
> > > > > > descriptive?
> > > > > >
> > > > > >
> > > > > About 1. and 2.:
> > > > > Agree. The example for this SMT is updated and have added the
> > `hoisted`
> > > > > configuration.
> > > > >
> > > > >
> > > > > > 3. Is there a reasonable use case for expanding Flatten to be
> able
> > to
> > > > > > flatten specific fields? My understanding is that it's mostly
> > useful
> > > > for
> > > > > > writing to systems like databases that don't support nested
> values
> > > and
> > > > > > require everything to be a flat list of key-value pairs. Being
> able
> > > to
> > > > > > flatten a nested field wouldn't provide any advantage for that
> use
> > > > case.
> > > > > > Are there other cases where it would?
> > > > > >
> > > > > > 4. I don't think we should unconditionally change the default
> > > delimiter
> > > > > for
> > > > > > Flatten. It's a backwards-incompatible, breaking change that
> could
> > > > cause
> > > > > > headaches for users. It might be reasonable to change the default
> > > value
> > > > > > dynamically based on whether the user has specified a value for
> the
> > > > > "field"
> > > > > > property, but considering the motivation for changing the default
> > is
> > > > that
> > > > > > it creates conflicts with the to-be-introduced nested field
> syntax
> > > > (which
> > > > > > could arise with downstream SMTs regardless of whether the user
> has
> > > > > > explicitly configured Flatten with the "field" property), I don't
> > > know
> > > > > that
> > > > > > this would be too useful either. I have some thoughts below on
> how
> > to
> > > > > > handle possible conflicts between names with dots in their names
> > and
> > > > > dotted
> > > > > > syntax for nested field references that should hopefully make
> > either
> > > > > change
> > > > > > unnecessary.
> > > > > >
> > > > > >
> > > > > Fair enough. With the support for nested fields in other SMTs,
> > Flatten
> > > > > could stay as it is.
> > > > > This removes the need for (4) changing Flatten config as well.
> > > > >
> > > > >
> > > > > > 5. I think it's fine to expand ExtractField to support nested
> > > notation,
> > > > > but
> > > > > > it might be worth noting in the rejected alternatives section
> that
> > > this
> > > > > > isn't strictly necessary since you can replace any single
> > invocation
> > > of
> > > > > > that SMT that uses nested field notation with multiple
> invocations
> > of
> > > > it
> > > > > > that use non-nested notation.
> > > > > >
> > > > >
> > > > > Agree. Adding it.
> > > > >
> > > > >
> > > > > >
> > > > > > 6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs
> > > that
> > > > do
> > > > > > not require nested structure support.
> > > > > >
> > > > > >
> > > > > Ack. Fixing it.
> > > > >
> > > > >
> > > > > > 7. It may be rare for dots in field names to occur in the wild
> > > > (although
> > > > > I
> > > > > > wouldn't be so certain of this), but unless we want to inflict
> > > > headaches
> > > > > on
> > > > > > users of Flatten, I think we're going to have to think about
> > > conflicts
> > > > > > between dotted notation and non-nested fields whose names contain
> > > > dots. I
> > > > > > don't think this is actually such a bad thing, though. I agree
> that
> > > > > dotted
> > > > > > notation is intuitive and pretty commonplace (in tools like jq,
> for
> > > > > > example), so I'd like it if we could stick to it. What about
> > > > introducing
> > > > > > escape syntax, using a backslash? That way, users could
> > disambiguate
> > > > > > between "this.field" (which would refer to the nested field
> "field"
> > > > under
> > > > > > the top-level "this" field), and "this\.field" (which would refer
> > to
> > > > the
> > > > > > field named "this.field"). Like with most languages that use the
> > > > > backslash
> > > > > > for escape sequences, it could also be used to escape itself, in
> > the
> > > > > event
> > > > > > that a field name contains a backslash. I think this is more
> > > intuitive
> > > > > and
> > > > > > simpler than, e.g., adding a new config property to toggle the
> > > > delimiter
> > > > > to
> > > > > > be used when parsing nested field references.
> > > > > >
> > > > >
> > > > > I like this approach. Adding to the KIP.
> > > > >
> > > > >
> > > > > >
> > > > > > 8. I don't think we can unconditionally turn this feature on. The
> > > risk
> > > > of
> > > > > > breaking existing pipelines (especially ones that involve, for
> > > > example, a
> > > > > > combination of the Flatten and Cast SMTs) is pretty high. I think
> > > this
> > > > > > should be an opt-in feature, at least until the next major
> release.
> > > One
> > > > > way
> > > > > > we could accomplish this is by introducing a new "field.style"
> > (name
> > > > > > obviously subject to change) property with values of "plain"
> > > (default)
> > > > > and
> > > > > > "nested". If set to "plain" then the current non-nested behavior
> is
> > > > used,
> > > > > > and if set to "nested", then the proposed nested behavior is. We
> > can
> > > > > > consider updating the default value to "nested" in a future major
> > > > release
> > > > > > (or even codify that plan in this KIP, if there's enough support
> > for
> > > > it).
> > > > > > This would also leave the door open for adding recursive support
> to
> > > > SMTs
> > > > > in
> > > > > > the future by adding a permitted value of "recursive".
> > > > > >
> > > > > > 9. One of the linked tickets in the "Motivation" section,
> > > > > > https://issues.apache.org/jira/browse/KAFKA-10640, has an open
> PR
> > > and
> > > > > KIP
> > > > > > that propose adding recursive support to some SMTs. Have you
> > > considered
> > > > > > this type of functionality for your KIP? Or is your aim to stick
> > > solely
> > > > > to
> > > > > > nested field support?
> > > > > >
> > > > >
> > > > > I like the `field.style` configuration flag approach.
> > > > > Thanks for pointing out the `recursive` approach. Will add `nested`
> > at
> > > > the
> > > > > moment, let's check the demand for `recursive` to consider it as
> part
> > > of
> > > > > this or another KIP.
> > > > >
> > > > > I have added the following on the KIP:
> > > > >
> > > > > ```
> > > > > Future KIPs could extend this support for:
> > > > >
> > > > > - Recursive notation: name a field and apply it to all fields
> across
> > > the
> > > > > schema matching that name.
> > > > > - Access to arrays: Adding `[]` notation to represent access to
> > arrays
> > > > and
> > > > > applying SMTs to fields within an array.
> > > > > ```
> > > > >
> > > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Tue, Feb 8, 2022 at 1:23 PM Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jo...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Dev team,
> > > > > > >
> > > > > > > I'd like to start a new discussion thread on Kafka Connect
> > KIP-821:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures
> > > > > > >
> > > > > > > This KIP is aimed to include support for nested structures on
> the
> > > > > > existing
> > > > > > > SMTs — where this make sense.
> > > > > > >
> > > > > > > Looking forward to your feedback.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Jorge.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to