[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect
[ https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813176#comment-17813176 ] Mickael Maison commented on KAFKA-15912: The Transformation interface only mentions that [implementations of apply() must be thread safe|https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java#L46]. This is not the case for Predicate. > Parallelize conversion and transformation steps in Connect > -- > > Key: KAFKA-15912 > URL: https://issues.apache.org/jira/browse/KAFKA-15912 > Project: Kafka > Issue Type: Improvement > Components: connect >Reporter: Mickael Maison >Priority: Major > > In busy Connect pipelines, the conversion and transformation steps can > sometimes have a very significant impact on performance. This is especially > true with large records with complex schemas, for example with CDC connectors > like Debezium. > Today in order to always preserve ordering, converters and transformations > are called on one record at a time in a single thread in the Connect worker. > As Connect usually handles records in batches (up to max.poll.records in sink > pipelines, for source pipelines while it really depends on the connector, > most connectors I've seen still tend to return multiple records each loop), > it could be highly beneficial to attempt running the converters and > transformation chain in parallel by a pool a processing threads. > It should be possible to do some of these steps in parallel and still keep > exact ordering. I'm even considering whether an option to lose ordering but > allow even faster processing would make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect
[ https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795926#comment-17795926 ] Vojtech Juranek commented on KAFKA-15912: - I'd be careful to do the parallelization per SMT/converter as moving data between threads maybe be in result more expensive, as you already mentioned. Also, if there is some bottleneck, e.g. value converter, running it in a single thread won't give any significant speed up. And this is actually what we (Debezium project) observe, either in our perf. tests or reported by the users (e.g. KAFKA-15996, resp. [DBZ-7240|https://issues.redhat.com/browse/DBZ-7240]). It would be IMHO more useful, if possible, to form record pipelines and run in parallel records through these pipelines in parallel. With this approach, thread safety can be solved e.g. by creating SMT/convertors copies for each processing pipeline. The issue with stateful transformation however remains. Also there is an issue with records ordering, however quite easily solvable when processing is done in batches. I'm currently doing some experiments with Debezium server, if such pipelines are possible there and if it gives any significant performance boost (still WIP, no results yet). So I'm wondering if doing something similar for Kafka Connect make sense for you or this seems to be too much complicated to worth the effort/possible backward compatibility issues/etc? > Parallelize conversion and transformation steps in Connect > -- > > Key: KAFKA-15912 > URL: https://issues.apache.org/jira/browse/KAFKA-15912 > Project: Kafka > Issue Type: Improvement > Components: connect >Reporter: Mickael Maison >Priority: Major > > In busy Connect pipelines, the conversion and transformation steps can > sometimes have a very significant impact on performance. This is especially > true with large records with complex schemas, for example with CDC connectors > like Debezium. > Today in order to always preserve ordering, converters and transformations > are called on one record at a time in a single thread in the Connect worker. > As Connect usually handles records in batches (up to max.poll.records in sink > pipelines, for source pipelines while it really depends on the connector, > most connectors I've seen still tend to return multiple records each loop), > it could be highly beneficial to attempt running the converters and > transformation chain in parallel by a pool a processing threads. > It should be possible to do some of these steps in parallel and still keep > exact ordering. I'm even considering whether an option to lose ordering but > allow even faster processing would make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect
[ https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792933#comment-17792933 ] Chris Egerton commented on KAFKA-15912: --- +1 for the concerns about lack of thread safety in existing SMTs, and for not breaking stateful SMTs that rely on in-order record delivery. I suppose we could still give each SMT/converter plugin (or a subset of them) a dedicated thread to work on. For example, in a source connector pipeline with two SMTs called "ValueToKey" and "ExtractField", and three converters for record keys, values, and headers, we could have something like this: Thread 1: ValueToKey, ExtractField (in that order) Thread 2: Header converter, key converter (in any order) Thread 3: Value converter Records would be delivered initially to the first thread, then passed to the second thread, then passed to the third, then back to the task thread (or, if we really want to get fancy, possibly dispatched directly to the producer). This would allow up to three records to be processed at a time, though it would still be susceptible to hotspots (e.g., if there are no headers involved, the header converter step is basically a no-op, and traversing the entire record value for value conversion is likely to be the most CPU-intensive step). It's also unclear if this kind of limited parallelism would lead to much performance improvement on workers running multiple tasks; my suspicion is that the CPU would be pretty well-saturated on many of these already. > Parallelize conversion and transformation steps in Connect > -- > > Key: KAFKA-15912 > URL: https://issues.apache.org/jira/browse/KAFKA-15912 > Project: Kafka > Issue Type: Improvement > Components: connect >Reporter: Mickael Maison >Priority: Major > > In busy Connect pipelines, the conversion and transformation steps can > sometimes have a very significant impact on performance. This is especially > true with large records with complex schemas, for example with CDC connectors > like Debezium. > Today in order to always preserve ordering, converters and transformations > are called on one record at a time in a single thread in the Connect worker. > As Connect usually handles records in batches (up to max.poll.records in sink > pipelines, for source pipelines while it really depends on the connector, > most connectors I've seen still tend to return multiple records each loop), > it could be highly beneficial to attempt running the converters and > transformation chain in parallel by a pool a processing threads. > It should be possible to do some of these steps in parallel and still keep > exact ordering. I'm even considering whether an option to lose ordering but > allow even faster processing would make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect
[ https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790672#comment-17790672 ] Greg Harris commented on KAFKA-15912: - Hey [~mimaison] thanks for the ticket. I've only thought briefly about this and haven't found any obvious blockers, but there are some design restrictions: # Since the javadocs for Transformation and Predicate don't mention thread-safety, I think we have to assume that they are not thread-safe # There is room in the API for a Transformation to be stateful and order-sensitive, (such as packing records together) so I think we would be unable to instantiate multiple copies of a single transform stage, and all records would have to pass serially through a stage. If Transformations and Predicates could declare themselves thread-safe, then we would be able to do some finer-grained parallelism, or fallback to actor-style parallelism (a single thread with message queue input). I think it would be ineffective/undesirable for Transformations to take this performance optimization burden upon themselves completely like the Task implementations do, so we should certainly improve the framework in this area. > Parallelize conversion and transformation steps in Connect > -- > > Key: KAFKA-15912 > URL: https://issues.apache.org/jira/browse/KAFKA-15912 > Project: Kafka > Issue Type: Improvement > Components: connect >Reporter: Mickael Maison >Priority: Major > > In busy Connect pipelines, the conversion and transformation steps can > sometimes have a very significant impact on performance. This is especially > true with large records with complex schemas, for example with CDC connectors > like Debezium. > Today in order to always preserve ordering, converters and transformations > are called on one record at a time in a single thread in the Connect worker. > As Connect usually handles records in batches (up to max.poll.records in sink > pipelines, for source pipelines while it really depends on the connector, > most connectors I've seen still tend to return multiple records each loop), > it could be highly beneficial to attempt running the converters and > transformation chain in parallel by a pool a processing threads. > It should be possible to do some of these steps in parallel and still keep > exact ordering. I'm even considering whether an option to lose ordering but > allow even faster processing would make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect
[ https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790505#comment-17790505 ] Mickael Maison commented on KAFKA-15912: [~ChrisEgerton] [~gharris] Is this something you've considered? > Parallelize conversion and transformation steps in Connect > -- > > Key: KAFKA-15912 > URL: https://issues.apache.org/jira/browse/KAFKA-15912 > Project: Kafka > Issue Type: Improvement > Components: connect >Reporter: Mickael Maison >Priority: Major > > In busy Connect pipelines, the conversion and transformation steps can > sometimes have a very significant impact on performance. This is especially > true with large records with complex schemas, for example with CDC connectors > like Debezium. > Today in order to always preserve ordering, converters and transformations > are called on one record at a time in a single thread in the Connect worker. > As Connect usually handles records in batches (up to max.poll.records in sink > pipelines, for source pipelines while it really depends on the connector, > most connectors I've seen still tend to return multiple records each loop), > it could be highly beneficial to attempt running the converters and > transformation chain in parallel by a pool a processing threads. > It should be possible to do some of these steps in parallel and still keep > exact ordering. I'm even considering whether an option to lose ordering but > allow even faster processing would make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)