Hi whsoul,

Would you mind updating the KIP document (
https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex)
with all of these changes? We tend to discuss and vote on what's included
in the Wiki as our source of truth, as opposed to pull requests.

RE 4:
> it seems the chained transform with extractField SMT + plainText value
ParseStructByRegex make the same result with struct value
ParseSturctByRegex, but it will drop collector meta data during
extractField ( I think.. )
This is exactly right--using ExtractField + ParseStructByRegex will replace
the record key/value with the parsed struct, and drop everything else. Do
you think it'd increase the implementation complexity significantly to add
support for users to specify a field to operate on, which would preserve
the rest of the record key/value as-is? I'm not so sure that everyone is
going to want to drop all other metadata from their messages, but if
there's something that makes this particularly difficult to implement (as
compared to the other SMTs that are already included out of the box with
Kafka Connect), then we could probably leave it out for now.

RE 6:
It's a tricky distinction, but what I mean is that, although we might add a
class named GroupRegexValidator and use that class in our SMT library,
unless it's part of the public interface we're trying to change, it's an
implementation detail and we don't have to call it out in the KIP. The
advantage to leaving it out is that it makes the KIP more concise and
therefore easier to review, you can choose to rename, remove, decompose,
etc. the class in your PR without having to worry about sticking to the
plan in the KIP that everyone reviewed and voted on. It's a minor detail
though, I'm noting it here more because it may be useful when writing
future KIPs than because it's necessary to adhere to strictly in this one.

Cheers,

Chris

On Fri, Jun 10, 2022 at 8:13 AM gyejun choi <whsou...@gmail.com> wrote:

> Hi Chris,
>
> I applied some code fix according your second reviews.
>
> https://github.com/apache/kafka/pull/12219/commits/f673ea2eae0d907502e44c0ecd53b616386627bf
>
>
> 1. [applied] changed name as ParseStructByRegex
>
> 2. [applied] throw DataException, when a line that the SMT sees doesn't
> match the regex...
> originally, it will be skipped if no data match with regex,
> but change code to throw DataException according to your review
>
> 3. [already applied]
> I already delete code about desc ":{TYPE}" with commit below
>
> https://github.com/apache/kafka/pull/12219/commits/534d995b3e6371c37443eb72eee03884cb23c85d
>
> 4. [need discuss]
> In my use case about log data collection,
> I configured the pipeline below
> nginx => filebeat => kafka => kafka connect es connector => es
>
> filebeat ( or most other log collector ) usually send value as struct ( not
> plaintext ) with collector meta data,
> and the key name as "message" ( in case filebeat )
>
> I think there are more use cases log message wrapped struct value than
> plain text value,
> and it seems the chained transform with extractField SMT + plainText value
> ParseStructByRegex make the same result with struct value
> ParseSturctByRegex,
> but it will drop collector meta data during extractField ( I think.. )
> and also in almost case, users will use ParseStructByRegex SMT with
> extractField SMT
>
> 5. [applied] throw DataException, when there are difference between group
> size and mapping names size
>
>
> 6. [ question ]
> you mean, use RegexValidator class already exist?
> without group "(.*)" pattern check, it will not provide early detection
> about regex config mistake,
> If you think it is enough as runtime DataException detection?
>
> always thanks.
>
> whsoul
>
> 2022년 6월 8일 (수) 오전 9:46, Chris Egerton <fearthecel...@gmail.com>님이 작성:
>
> > Hi whsoul,
> >
> > Thanks for the updates. I have a few more thoughts but this is looking
> > pretty good:
> >
> > 1. The name "ToStructByRegexTransform" is a little unwieldy. What do you
> > think about something shorter like ParseStruct, ParseRegex, or
> > ParseStructByRegex?
> >
> > 2. What happens if a line that the SMT sees doesn't match the regex
> > supplied by the user? Will it throw an exception, silently skip the
> message
> > and return it as-is, allow the user to configure it to do either,
> something
> > else entirely? (I'd personally lean towards just throwing an exception
> > since users can configure regexes to be lenient via the optional
> > quantifier, i.e. "?")
> >
> > 3. The description for the "regex" property still includes the "( with
> > :{TYPE} )" snippet; should that be removed?
> >
> > 4. Is it worth adding support to this SMT to operate on an individual
> field
> > of a message? I.e., you specify a "field" of "log_line" and the SMT
> expects
> > to see a struct or a map with a field/key of "log_line" and parses that
> > instead of the entire message. If so, it might be worth specifying that
> > this property would follow any precedents set by KIP-821 [1] so that
> nested
> > fields could be accessed instead of just top-level fields.
> >
> > 5. What happens if the user specifies more names in the "mapping"
> property
> > than there are groups in the "regex" property?
> >
> > 6. (Nit) I don't think the GroupRegexValidator class needs to be called
> out
> > as part of the changes to public interface if it's just going to be used
> by
> > this transform.
> >
> > [1] -
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures
> >
> > Cheers,
> >
> > Chris
> >
> > On Tue, Jun 7, 2022 at 4:47 AM gyejun choi <whsou...@gmail.com> wrote:
> >
> > > Hi Chris,
> > >
> > > Thank you for your positive feed back,
> > > And sorry about late reply ( I didn’t recognize your reply email… TT )
> > >
> > > And I update KIP and PR with your review
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex
> > >
> > > 1. typecast function removed
> > > 2. struct.field removed
> > > 3. Proposed Interface changed
> > >
> > >
> > > I will wait for you second feed back
> > >
> > > Thanks~
> > >
> > > On 2021/07/15 15:57:14 Chris Egerton wrote:
> > > > Hi whsoul,
> > > >
> > > > Thanks for the KIP. The two use cases you identified seem quite
> > > appealing,
> > > > especially the first one: parsing structured log messages.
> > > >
> > > > Here are my initial thoughts on the proposed design:
> > > >
> > > > 1. I wonder if it's necessary to include support for type casting
> with
> > > this
> > > > SMT. We already have a Cast SMT (
> > > >
> > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/Cast.java
> > > )
> > > > that can parse multiple fields of a structured record value with
> > > differing
> > > > types. Would it be enough for your new SMT to only produce string
> > values
> > > > for its structured data, and then allow users to perform casting
> logic
> > > > using the Cast SMT afterward?
> > > >
> > > > 2. It seems like the "struct.field" property is similar; based on the
> > > > examples, it looks like when the SMT is configured with a value for
> > that
> > > > property, it will first pull out a field from a structured record
> value
> > > > (for example, it would pull out the value "
> > > > https://kafka.apache.org/documentation/#connect"; from a map of
> > {"url": "
> > > > https://kafka.apache.org/documentation/#connect"}), then parse that
> > > field's
> > > > value, and replace the entire record value (or key) with the result
> of
> > > the
> > > > parsing stage. It seems like this could be accomplished using the
> > > > ExtractField SMT (
> > > >
> > >
> > >
> >
> https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ExtractField.java
> > > )
> > > > as a preliminary step before passing it to your new SMT. Is this
> > correct?
> > > > And if so, could we simplify the interface for your SMT by removing
> the
> > > > "struct.field" property in favor of the existing ExtractField SMT?
> > > >
> > > > 3. (Nit) I believe the property names in the table beneath the
> > "Proposed
> > > > Interfaces" section should have the "transforms.RegexTransform."
> prefix
> > > > removed, since users will be able to configure the SMT with any name,
> > not
> > > > just "RegexTransform". This keeps in line with similar KIPs such as
> > > > KIP-437, which added a new property to the MaskField SMT (
> > > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-437%3A+Custom+replacement+for+MaskField+SMT
> > > > ).
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Thu, Jul 15, 2021 at 7:56 AM gyejun choi <wh...@gmail.com> wrote:
> > > >
> > > > > is there someone who any feed back about this?
> > > > >
> > > > > 2020년 10월 23일 (금) 오후 2:56, gyejun choi <wh...@gmail.com>님이 작성:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I've opened KIP-678 which is intended to provide a new SMT in
> Kafka
> > > > > Connect.
> > > > > > I'd be grateful for any
> > > > > > feedback:
> > > > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex==
> > > > > >
> > > > > > thanks,
> > > > > >
> > > > > > whsoul
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to