[VOTE] KIP-678: New Kafka Connect SMT for plainText => struct with Regex
Hi all, I'd like to start a vote for KIP-678: New Kafka Connect SMT for plainText => struct with Regex KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex JIRA: https://github.com/apache/kafka/pull/12219 Discussion thread: https://lists.apache.org/thread/xb57l7j953k8dfgqvktb09y31vzpm1xx https://lists.apache.org/thread/7t1k0ko8l973v4oj3l983j7qpwolhyzf Thanks, whsoul
Re: Re: KIP 678: New Kafka Connect SMT for plainText => Struct(or Map) with Regex
Hello Chris, Long time no feed back about this, do I have any work left to do to proceed with discuss and vote? I will waiting for your guide, thanks, always whsoul 2022년 6월 14일 (화) 오후 9:29, gyejun choi 님이 작성: > Hi Chris > > I updated KIP document > > https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex > ) > about discussion and changes before, > > and commited some code according to your review. > > https://github.com/apache/kafka/pull/12219/commits/1541a538eaff676441f125c400a7807d17f2e138 > > I understand you mean the simplicity is more important with SMT ( because > it is Simple Message Trasform.. ) > so I removed about "message" structured format input support > > and also move GroupRegexValidator to inner private class. > > thanks, always > > whsoul > > 2022년 6월 14일 (화) 오전 11:30, Chris Egerton 님이 작성: > >> Hi whsoul, >> >> Would you mind updating the KIP document ( >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex >> ) >> with all of these changes? We tend to discuss and vote on what's included >> in the Wiki as our source of truth, as opposed to pull requests. >> >> RE 4: >> > it seems the chained transform with extractField SMT + plainText value >> ParseStructByRegex make the same result with struct value >> ParseSturctByRegex, but it will drop collector meta data during >> extractField ( I think.. ) >> This is exactly right--using ExtractField + ParseStructByRegex will >> replace >> the record key/value with the parsed struct, and drop everything else. Do >> you think it'd increase the implementation complexity significantly to add >> support for users to specify a field to operate on, which would preserve >> the rest of the record key/value as-is? I'm not so sure that everyone is >> going to want to drop all other metadata from their messages, but if >> there's something that makes this particularly difficult to implement (as >> compared to the other SMTs that are already included out of the box with >> Kafka Connect), then we could probably leave it out for now. >> >> RE 6: >> It's a tricky distinction, but what I mean is that, although we might add >> a >> class named GroupRegexValidator and use that class in our SMT library, >> unless it's part of the public interface we're trying to change, it's an >> implementation detail and we don't have to call it out in the KIP. The >> advantage to leaving it out is that it makes the KIP more concise and >> therefore easier to review, you can choose to rename, remove, decompose, >> etc. the class in your PR without having to worry about sticking to the >> plan in the KIP that everyone reviewed and voted on. It's a minor detail >> though, I'm noting it here more because it may be useful when writing >> future KIPs than because it's necessary to adhere to strictly in this one. >> >> Cheers, >> >> Chris >> >> On Fri, Jun 10, 2022 at 8:13 AM gyejun choi wrote: >> >> > Hi Chris, >> > >> > I applied some code fix according your second reviews. >> > >> > >> https://github.com/apache/kafka/pull/12219/commits/f673ea2eae0d907502e44c0ecd53b616386627bf >> > >> > >> > 1. [applied] changed name as ParseStructByRegex >> > >> > 2. [applied] throw DataException, when a line that the SMT sees doesn't >> > match the regex... >> > originally, it will be skipped if no data match with regex, >> > but change code to throw DataException according to your review >> > >> > 3. [already applied] >> > I already delete code about desc ":{TYPE}" with commit below >> > >> > >> https://github.com/apache/kafka/pull/12219/commits/534d995b3e6371c37443eb72eee03884cb23c85d >> > >> > 4. [need discuss] >> > In my use case about log data collection, >> > I configured the pipeline below >> > nginx => filebeat => kafka => kafka connect es connector => es >> > >> > filebeat ( or most other log collector ) usually send value as struct ( >> not >> > plaintext ) with collector meta data, >> > and the key name as "message" ( in case filebeat ) >> > >> > I think there are more use cases log message wrapped struct value than >> > plain text value, >> > and it seems the chained transform with extract
Re: Re: KIP 678: New Kafka Connect SMT for plainText => Struct(or Map) with Regex
Hi Chris I updated KIP document https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex ) about discussion and changes before, and commited some code according to your review. https://github.com/apache/kafka/pull/12219/commits/1541a538eaff676441f125c400a7807d17f2e138 I understand you mean the simplicity is more important with SMT ( because it is Simple Message Trasform.. ) so I removed about "message" structured format input support and also move GroupRegexValidator to inner private class. thanks, always whsoul 2022년 6월 14일 (화) 오전 11:30, Chris Egerton 님이 작성: > Hi whsoul, > > Would you mind updating the KIP document ( > > https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex > ) > with all of these changes? We tend to discuss and vote on what's included > in the Wiki as our source of truth, as opposed to pull requests. > > RE 4: > > it seems the chained transform with extractField SMT + plainText value > ParseStructByRegex make the same result with struct value > ParseSturctByRegex, but it will drop collector meta data during > extractField ( I think.. ) > This is exactly right--using ExtractField + ParseStructByRegex will replace > the record key/value with the parsed struct, and drop everything else. Do > you think it'd increase the implementation complexity significantly to add > support for users to specify a field to operate on, which would preserve > the rest of the record key/value as-is? I'm not so sure that everyone is > going to want to drop all other metadata from their messages, but if > there's something that makes this particularly difficult to implement (as > compared to the other SMTs that are already included out of the box with > Kafka Connect), then we could probably leave it out for now. > > RE 6: > It's a tricky distinction, but what I mean is that, although we might add a > class named GroupRegexValidator and use that class in our SMT library, > unless it's part of the public interface we're trying to change, it's an > implementation detail and we don't have to call it out in the KIP. The > advantage to leaving it out is that it makes the KIP more concise and > therefore easier to review, you can choose to rename, remove, decompose, > etc. the class in your PR without having to worry about sticking to the > plan in the KIP that everyone reviewed and voted on. It's a minor detail > though, I'm noting it here more because it may be useful when writing > future KIPs than because it's necessary to adhere to strictly in this one. > > Cheers, > > Chris > > On Fri, Jun 10, 2022 at 8:13 AM gyejun choi wrote: > > > Hi Chris, > > > > I applied some code fix according your second reviews. > > > > > https://github.com/apache/kafka/pull/12219/commits/f673ea2eae0d907502e44c0ecd53b616386627bf > > > > > > 1. [applied] changed name as ParseStructByRegex > > > > 2. [applied] throw DataException, when a line that the SMT sees doesn't > > match the regex... > > originally, it will be skipped if no data match with regex, > > but change code to throw DataException according to your review > > > > 3. [already applied] > > I already delete code about desc ":{TYPE}" with commit below > > > > > https://github.com/apache/kafka/pull/12219/commits/534d995b3e6371c37443eb72eee03884cb23c85d > > > > 4. [need discuss] > > In my use case about log data collection, > > I configured the pipeline below > > nginx => filebeat => kafka => kafka connect es connector => es > > > > filebeat ( or most other log collector ) usually send value as struct ( > not > > plaintext ) with collector meta data, > > and the key name as "message" ( in case filebeat ) > > > > I think there are more use cases log message wrapped struct value than > > plain text value, > > and it seems the chained transform with extractField SMT + plainText > value > > ParseStructByRegex make the same result with struct value > > ParseSturctByRegex, > > but it will drop collector meta data during extractField ( I think.. ) > > and also in almost case, users will use ParseStructByRegex SMT with > > extractField SMT > > > > 5. [applied] throw DataException, when there are difference between group > > size and mapping names size > > > > > > 6. [ question ] > > you mean, use RegexValidator class already exist? > > without group "(.*)" pattern check, it will not provide early detection >
Re: Re: KIP 678: New Kafka Connect SMT for plainText => Struct(or Map) with Regex
Hi Chris, I applied some code fix according your second reviews. https://github.com/apache/kafka/pull/12219/commits/f673ea2eae0d907502e44c0ecd53b616386627bf 1. [applied] changed name as ParseStructByRegex 2. [applied] throw DataException, when a line that the SMT sees doesn't match the regex... originally, it will be skipped if no data match with regex, but change code to throw DataException according to your review 3. [already applied] I already delete code about desc ":{TYPE}" with commit below https://github.com/apache/kafka/pull/12219/commits/534d995b3e6371c37443eb72eee03884cb23c85d 4. [need discuss] In my use case about log data collection, I configured the pipeline below nginx => filebeat => kafka => kafka connect es connector => es filebeat ( or most other log collector ) usually send value as struct ( not plaintext ) with collector meta data, and the key name as "message" ( in case filebeat ) I think there are more use cases log message wrapped struct value than plain text value, and it seems the chained transform with extractField SMT + plainText value ParseStructByRegex make the same result with struct value ParseSturctByRegex, but it will drop collector meta data during extractField ( I think.. ) and also in almost case, users will use ParseStructByRegex SMT with extractField SMT 5. [applied] throw DataException, when there are difference between group size and mapping names size 6. [ question ] you mean, use RegexValidator class already exist? without group "(.*)" pattern check, it will not provide early detection about regex config mistake, If you think it is enough as runtime DataException detection? always thanks. whsoul 2022년 6월 8일 (수) 오전 9:46, Chris Egerton 님이 작성: > Hi whsoul, > > Thanks for the updates. I have a few more thoughts but this is looking > pretty good: > > 1. The name "ToStructByRegexTransform" is a little unwieldy. What do you > think about something shorter like ParseStruct, ParseRegex, or > ParseStructByRegex? > > 2. What happens if a line that the SMT sees doesn't match the regex > supplied by the user? Will it throw an exception, silently skip the message > and return it as-is, allow the user to configure it to do either, something > else entirely? (I'd personally lean towards just throwing an exception > since users can configure regexes to be lenient via the optional > quantifier, i.e. "?") > > 3. The description for the "regex" property still includes the "( with > :{TYPE} )" snippet; should that be removed? > > 4. Is it worth adding support to this SMT to operate on an individual field > of a message? I.e., you specify a "field" of "log_line" and the SMT expects > to see a struct or a map with a field/key of "log_line" and parses that > instead of the entire message. If so, it might be worth specifying that > this property would follow any precedents set by KIP-821 [1] so that nested > fields could be accessed instead of just top-level fields. > > 5. What happens if the user specifies more names in the "mapping" property > than there are groups in the "regex" property? > > 6. (Nit) I don't think the GroupRegexValidator class needs to be called out > as part of the changes to public interface if it's just going to be used by > this transform. > > [1] - > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures > > Cheers, > > Chris > > On Tue, Jun 7, 2022 at 4:47 AM gyejun choi wrote: > > > Hi Chris, > > > > Thank you for your positive feed back, > > And sorry about late reply ( I didn’t recognize your reply email… TT ) > > > > And I update KIP and PR with your review > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex > > > > 1. typecast function removed > > 2. struct.field removed > > 3. Proposed Interface changed > > > > > > I will wait for you second feed back > > > > Thanks~ > > > > On 2021/07/15 15:57:14 Chris Egerton wrote: > > > Hi whsoul, > > > > > > Thanks for the KIP. The two use cases you identified seem quite > > appealing, > > > especially the first one: parsing structured log messages. > > > > > > Here are my initial thoughts on the proposed design: > > > > > > 1. I wonder if it's necessary to include support for type casting with > > this > > > SMT. We already have a Cast SMT ( > > > > > > > > https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apach
RE: Re: KIP 678: New Kafka Connect SMT for plainText => Struct(or Map) with Regex
Hi Chris, Thank you for your positive feed back, And sorry about late reply ( I didn’t recognize your reply email… TT ) And I update KIP and PR with your review https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex 1. typecast function removed 2. struct.field removed 3. Proposed Interface changed I will wait for you second feed back Thanks~ On 2021/07/15 15:57:14 Chris Egerton wrote: > Hi whsoul, > > Thanks for the KIP. The two use cases you identified seem quite appealing, > especially the first one: parsing structured log messages. > > Here are my initial thoughts on the proposed design: > > 1. I wonder if it's necessary to include support for type casting with this > SMT. We already have a Cast SMT ( > https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/Cast.java ) > that can parse multiple fields of a structured record value with differing > types. Would it be enough for your new SMT to only produce string values > for its structured data, and then allow users to perform casting logic > using the Cast SMT afterward? > > 2. It seems like the "struct.field" property is similar; based on the > examples, it looks like when the SMT is configured with a value for that > property, it will first pull out a field from a structured record value > (for example, it would pull out the value " > https://kafka.apache.org/documentation/#connect"; from a map of {"url": " > https://kafka.apache.org/documentation/#connect"}), then parse that field's > value, and replace the entire record value (or key) with the result of the > parsing stage. It seems like this could be accomplished using the > ExtractField SMT ( > https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ExtractField.java ) > as a preliminary step before passing it to your new SMT. Is this correct? > And if so, could we simplify the interface for your SMT by removing the > "struct.field" property in favor of the existing ExtractField SMT? > > 3. (Nit) I believe the property names in the table beneath the "Proposed > Interfaces" section should have the "transforms.RegexTransform." prefix > removed, since users will be able to configure the SMT with any name, not > just "RegexTransform". This keeps in line with similar KIPs such as > KIP-437, which added a new property to the MaskField SMT ( > https://cwiki.apache.org/confluence/display/KAFKA/KIP-437%3A+Custom+replacement+for+MaskField+SMT > ). > > Cheers, > > Chris > > On Thu, Jul 15, 2021 at 7:56 AM gyejun choi wrote: > > > is there someone who any feed back about this? > > > > 2020년 10월 23일 (금) 오후 2:56, gyejun choi 님이 작성: > > > > > Hi, > > > > > > I've opened KIP-678 which is intended to provide a new SMT in Kafka > > Connect. > > > I'd be grateful for any > > > feedback: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex== > > > > > > thanks, > > > > > > whsoul > > > > > > > > >
Re: KIP 678: New Kafka Connect SMT for plainText => Struct(or Map) with Regex
is there someone who any feed back about this? 2020년 10월 23일 (금) 오후 2:56, gyejun choi 님이 작성: > Hi, > > I've opened KIP-678 which is intended to provide a new SMT in Kafka Connect. > I'd be grateful for any > feedback:https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex== > > thanks, > > whsoul > >
KIP 678: New Kafka Connect SMT for plainText => Struct(or Map) with Regex
Hi, I've opened KIP-678 which is intended to provide a new SMT in Kafka Connect. I'd be grateful for any feedback:https://cwiki.apache.org/confluence/display/KAFKA/KIP+678%3A+New+Kafka+Connect+SMT+for+plainText+%3D%3E+Struct%28or+Map%29+with+Regex== thanks, whsoul
need permission to create KIP
I need a permission with this page for "Create KIP" https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals wiki userId : whsoul82 thanks