Nishanth created SPARK-54039:
--------------------------------

             Summary: Add TaskID to KafkaDataConsumer logs for better 
correlation between Spark tasks and Kafka operations
                 Key: SPARK-54039
                 URL: https://issues.apache.org/jira/browse/SPARK-54039
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.5.6
            Reporter: Nishanth


Currently, the Kafka consumer logs generated by {{KafkaDataConsumer}} (in 
{{{}org.apache.spark.sql.kafka010{}}}) do not include the Spark {{TaskID}} or 
task context information.

This makes it difficult to correlate specific Kafka fetch or poll operations 
with the corresponding Spark tasks during debugging and performance 
investigations—especially when multiple tasks consume from the same Kafka topic 
partitions concurrently on different executors.

Adding the Spark {{TaskID}} To the Kafka consumer, log statements would 
significantly improve traceability and observability. It would allow engineers 
to:
 * Identify which Spark task triggered a specific Kafka poll or fetch

 * Diagnose task-level Kafka latency or deserialization bottlenecks

 * Simplify debugging of offset commit or consumer lag anomalies when multiple 
concurrent consumers are running

h3. *Current Behavior*

The current Kafka source consumer logging framework does not include {{TaskID}} 
information in the {{KafkaDataConsumer}} class.

*Example log output:*
{code:java}
14:30:16 INFO Executor: Running task 83.0 in stage 32.0 (TID 4188)
14:30:16 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka 
reader... taskId=4188 partitionId=83
14:35:19 INFO KafkaDataConsumer: From Kafka topicPartition=******* 
groupId=******* read 12922 records through 80 polls (polled out 12979 records), 
taking 299911.706198 ms, during time span of 303617.473973 ms  <-- No Task 
Context Here
14:35:19 INFO Executor: Finished task 83.0 in stage 32.0 (TID 4188){code}
Although the Executor logs include the task context (TID), the 
KafkaDataConsumer logs do not, making it difficult to correlate Kafka read 
durations or latency with the specific Spark tasks that executed them. This 
limits visibility into which part of a job spent the most time consuming from 
Kafka.
h3. *Impact*

The absence of task context in {{KafkaDataConsumer}} logs creates several 
challenges:
 * *Performance Debugging:* Unable to pinpoint which specific Spark tasks 
experience slow Kafka consumption.

 * *Support/Troubleshooting:* Difficult to correlate customer incidents or 
partition-level delays with specific task executions.

 * *Metrics Analysis:* Cannot easily aggregate Kafka consumption metrics per 
task for performance profiling or benchmarking.

 * *Log Analysis:* Requires complex log correlation or pattern matching across 
multiple executors to map Kafka operations to specific tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to