[ 
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790672#comment-17790672
 ] 

Greg Harris commented on KAFKA-15912:
-------------------------------------

Hey [~mimaison] thanks for the ticket. I've only thought briefly about this and 
haven't found any obvious blockers, but there are some design restrictions:
 # Since the javadocs for Transformation and Predicate don't mention 
thread-safety, I think we have to assume that they are not thread-safe
 # There is room in the API for a Transformation to be stateful and 
order-sensitive, (such as packing records together) so I think we would be 
unable to instantiate multiple copies of a single transform stage, and all 
records would have to pass serially through a stage.

If Transformations and Predicates could declare themselves thread-safe, then we 
would be able to do some finer-grained parallelism, or fallback to actor-style 
parallelism (a single thread with message queue input).

I think it would be ineffective/undesirable for Transformations to take this 
performance optimization burden upon themselves completely like the Task 
implementations do, so we should certainly improve the framework in this area.

> Parallelize conversion and transformation steps in Connect
> ----------------------------------------------------------
>
>                 Key: KAFKA-15912
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15912
>             Project: Kafka
>          Issue Type: Improvement
>          Components: connect
>            Reporter: Mickael Maison
>            Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can 
> sometimes have a very significant impact on performance. This is especially 
> true with large records with complex schemas, for example with CDC connectors 
> like Debezium.
> Today in order to always preserve ordering, converters and transformations 
> are called on one record at a time in a single thread in the Connect worker. 
> As Connect usually handles records in batches (up to max.poll.records in sink 
> pipelines, for source pipelines while it really depends on the connector, 
> most connectors I've seen still tend to return multiple records each loop), 
> it could be highly beneficial to attempt running the converters and 
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep 
> exact ordering. I'm even considering whether an option to lose ordering but 
> allow even faster processing would make sense.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to