[ 
https://issues.apache.org/jira/browse/SPARK-54039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-54039:
---------------------------------
    Fix Version/s:     (was: 4.1.0)

> SS | `Add TaskID to KafkaDataConsumer logs for better correlation between 
> Spark tasks and Kafka operations`
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-54039
>                 URL: https://issues.apache.org/jira/browse/SPARK-54039
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.4.1, 3.5.0, 3.5.2, 4.0.0
>            Reporter: Nishanth
>            Priority: Major
>              Labels: pull-request-available
>
> Kafka consumer logs generated by {{KafkaDataConsumer}} do not include the 
> Spark {{TaskID}} or task context information.
> This makes it difficult to correlate specific Kafka fetch or poll operations 
> with the corresponding Spark tasks during debugging and performance 
> investigations—especially when multiple tasks consume from the same Kafka 
> topic partitions concurrently on different executors.
> h3. *Current Behavior*
> The current Kafka source consumer logging framework does not include 
> {{TaskID}} information in the {{KafkaDataConsumer}} class.
> *Example log output:*
> {code:java}
> 14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
> 14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka 
> reader... taskId=4188 partitionId=83
> 14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
> groupId=******* read 12922 records through 80 polls (polled out 12979 
> records), taking 299911.706198 ms, during time span of 303617.473973 ms  <-- 
> No Task Context Here
> 14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
> Although the Executor logs include the task context (TID), the 
> KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
> durations or latency with the specific Spark tasks that executed them. This 
> limits visibility into which part of a job spends the most time consuming 
> from Kafka.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to