[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect

2024-02-01 Thread Mickael Maison (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813176#comment-17813176
 ] 

Mickael Maison commented on KAFKA-15912:


The Transformation interface only mentions that [implementations of apply() 
must be thread 
safe|https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java#L46].
 This is not the case for Predicate.

> Parallelize conversion and transformation steps in Connect
> --
>
> Key: KAFKA-15912
> URL: https://issues.apache.org/jira/browse/KAFKA-15912
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect
>Reporter: Mickael Maison
>Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect

2023-12-12 Thread Vojtech Juranek (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795926#comment-17795926
 ] 

Vojtech Juranek commented on KAFKA-15912:
-

I'd be careful to do the parallelization per SMT/converter as moving data 
between threads maybe be in result more expensive, as you already mentioned. 
Also, if there is some bottleneck, e.g. value converter, running it in a single 
thread won't give any significant speed up. And this is actually what we 
(Debezium project) observe, either in our perf. tests or reported by the users 
(e.g. KAFKA-15996, resp. [DBZ-7240|https://issues.redhat.com/browse/DBZ-7240]). 
It would be IMHO more useful, if possible, to form record pipelines and run in 
parallel records through these pipelines in parallel. With this approach, 
thread safety can be solved e.g. by creating SMT/convertors copies for each 
processing pipeline. The issue with stateful transformation however remains. 
Also there is an issue with records ordering, however quite easily solvable 
when processing is done in batches. I'm currently doing some experiments with 
Debezium server, if such pipelines are possible there and if it gives any 
significant performance boost (still WIP, no results yet). So I'm wondering if 
doing something similar for Kafka Connect make sense for you or this seems to 
be too much complicated to worth the effort/possible backward compatibility 
issues/etc?

> Parallelize conversion and transformation steps in Connect
> --
>
> Key: KAFKA-15912
> URL: https://issues.apache.org/jira/browse/KAFKA-15912
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect
>Reporter: Mickael Maison
>Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect

2023-12-04 Thread Chris Egerton (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792933#comment-17792933
 ] 

Chris Egerton commented on KAFKA-15912:
---

+1 for the concerns about lack of thread safety in existing SMTs, and for not 
breaking stateful SMTs that rely on in-order record delivery.

I suppose we could still give each SMT/converter plugin (or a subset of them) a 
dedicated thread to work on. For example, in a source connector pipeline with 
two SMTs called "ValueToKey" and "ExtractField", and three converters for 
record keys, values, and headers, we could have something like this:

 

Thread 1: ValueToKey, ExtractField (in that order)

Thread 2: Header converter, key converter (in any order)

Thread 3: Value converter

 

Records would be delivered initially to the first thread, then passed to the 
second thread, then passed to the third, then back to the task thread (or, if 
we really want to get fancy, possibly dispatched directly to the producer).

This would allow up to three records to be processed at a time, though it would 
still be susceptible to hotspots (e.g., if there are no headers involved, the 
header converter step is basically a no-op, and traversing the entire record 
value for value conversion is likely to be the most CPU-intensive step). It's 
also unclear if this kind of limited parallelism would lead to much performance 
improvement on workers running multiple tasks; my suspicion is that the CPU 
would be pretty well-saturated on many of these already.

> Parallelize conversion and transformation steps in Connect
> --
>
> Key: KAFKA-15912
> URL: https://issues.apache.org/jira/browse/KAFKA-15912
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect
>Reporter: Mickael Maison
>Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect

2023-11-28 Thread Greg Harris (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790672#comment-17790672
 ] 

Greg Harris commented on KAFKA-15912:
-

Hey [~mimaison] thanks for the ticket. I've only thought briefly about this and 
haven't found any obvious blockers, but there are some design restrictions:
 # Since the javadocs for Transformation and Predicate don't mention 
thread-safety, I think we have to assume that they are not thread-safe
 # There is room in the API for a Transformation to be stateful and 
order-sensitive, (such as packing records together) so I think we would be 
unable to instantiate multiple copies of a single transform stage, and all 
records would have to pass serially through a stage.

If Transformations and Predicates could declare themselves thread-safe, then we 
would be able to do some finer-grained parallelism, or fallback to actor-style 
parallelism (a single thread with message queue input).

I think it would be ineffective/undesirable for Transformations to take this 
performance optimization burden upon themselves completely like the Task 
implementations do, so we should certainly improve the framework in this area.

> Parallelize conversion and transformation steps in Connect
> --
>
> Key: KAFKA-15912
> URL: https://issues.apache.org/jira/browse/KAFKA-15912
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect
>Reporter: Mickael Maison
>Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect

2023-11-28 Thread Mickael Maison (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790505#comment-17790505
 ] 

Mickael Maison commented on KAFKA-15912:


[~ChrisEgerton] [~gharris] Is this something you've considered?

> Parallelize conversion and transformation steps in Connect
> --
>
> Key: KAFKA-15912
> URL: https://issues.apache.org/jira/browse/KAFKA-15912
> Project: Kafka
>  Issue Type: Improvement
>  Components: connect
>Reporter: Mickael Maison
>Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)