[ https://issues.apache.org/jira/browse/KAFKA-19238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
karthickthavasiraj09 updated KAFKA-19238: ----------------------------------------- Description: We had an issue while reading data from the Azure Event hubs through Azure Databricks. After working with Microsoft team they confirmed that there's an issue from Kafka side. I'm sharing the debug logs shared by the Microsoft team below, That took 49m, we see that task 143 takes 46 mins _(out of the job duration of_ _49m 30s)_ _25/04/15 14:21:44 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka reader topicPartition=<topic-name>-0 fromOffset=16511904 untilOffset=16658164, for query queryId=dd660d4d-05cc-4a8e-8f93-d202ec78fec3 runId=af7eb711-7310-4788-85b7-0977fc0756b7 batchId=73 taskId=143 partitionId=0_ _._ _25/04/15 15:07:21 INFO KafkaDataConsumer: From Kafka topicPartition=<topic-name>-0 groupId=spark-kafka-source-da79e0fc-8ee5-40f5-a127-7b31766b3022--1737876659-executor read 146260 records through 4314 polls (polled out 146265 records), taking 2526471821132 nanos, during time span of 2736294068630 nanos._ And this task is waiting for Kafka to respond for most of the time as we can see from the threads: _Executor task launch worker for task 0.0 in stage 147.0 (TID 143)_ _sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)_ _sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)_ _sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)_ _sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked sun.nio.ch.EPollSelectorImpl@54f8f9b6_ _sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)_ _kafkashaded.org.apache.kafka.common.network.Selector.select(Selector.java:874)_ _kafkashaded.org.apache.kafka.common.network.Selector.poll(Selector.java:465)_ _kafkashaded.org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)_ _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:280)_ _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251)_ _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)_ _kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1759)_ _kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1717)_ _org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.getAvailableOffsetRange(KafkaDataConsumer.scala:110)_ _org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.fetch(KafkaDataConsumer.scala:84)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$fetchData$1(KafkaDataConsumer.scala:593)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$$Lambda$4556/228899458.apply(Unknown Source)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.timeNanos(KafkaDataConsumer.scala:696)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchData(KafkaDataConsumer.scala:593)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchRecord(KafkaDataConsumer.scala:517)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:325)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$$Lambda$4491/342980175.apply(Unknown Source)_ _org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:686)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:301)_ _org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:106)_ _._ was: We had an issue while reading data from the Azure Event hubs through Azure Databricks. After working with Microsoft team they confirmed that there's an issue from Kafka side. I'm sharing the debug logs shared by the Microsoft team below, The good job shared on March 20th, so we would not be able to download the backend logs _(as it's > 20 days)_ But for the bad job: [https://adb-2632737963103362.2.azuredatabricks.net/jobs/911028616577296/runs/939144212532710?o=2632737963103362] that took 49m, we see that task 143 takes 46 mins _(out of the job duration of_ _49m 30s)_ _25/04/15 14:21:44 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: Creating Kafka reader topicPartition=voyager-prod-managedsql-cus.order.orders.orderitem-0 fromOffset=16511904 untilOffset=16658164, for query queryId=dd660d4d-05cc-4a8e-8f93-d202ec78fec3 runId=af7eb711-7310-4788-85b7-0977fc0756b7 batchId=73 taskId=143 partitionId=0_ _._ _25/04/15 15:07:21 INFO KafkaDataConsumer: From Kafka topicPartition=voyager-prod-managedsql-cus.order.orders.orderitem-0 groupId=spark-kafka-source-da79e0fc-8ee5-40f5-a127-7b31766b3022--1737876659-executor read 146260 records through 4314 polls (polled out 146265 records), taking 2526471821132 nanos, during time span of 2736294068630 nanos._ And this task is waiting for Kafka to respond for most of the time as we can see from the threads: _Executor task launch worker for task 0.0 in stage 147.0 (TID 143)_ _sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)_ _sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)_ _sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)_ _sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked sun.nio.ch.EPollSelectorImpl@54f8f9b6_ _sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)_ _kafkashaded.org.apache.kafka.common.network.Selector.select(Selector.java:874)_ _kafkashaded.org.apache.kafka.common.network.Selector.poll(Selector.java:465)_ _kafkashaded.org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)_ _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:280)_ _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251)_ _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)_ _kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1759)_ _kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1717)_ _org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.getAvailableOffsetRange(KafkaDataConsumer.scala:110)_ _org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.fetch(KafkaDataConsumer.scala:84)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$fetchData$1(KafkaDataConsumer.scala:593)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$$Lambda$4556/228899458.apply(Unknown Source)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.timeNanos(KafkaDataConsumer.scala:696)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchData(KafkaDataConsumer.scala:593)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchRecord(KafkaDataConsumer.scala:517)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:325)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$$Lambda$4491/342980175.apply(Unknown Source)_ _org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:686)_ _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:301)_ _org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:106)_ _._ > We're facing issue in Kafka while reading data from Azure event hubs through > Azure Databricks > --------------------------------------------------------------------------------------------- > > Key: KAFKA-19238 > URL: https://issues.apache.org/jira/browse/KAFKA-19238 > Project: Kafka > Issue Type: Test > Components: connect, consumer, network > Affects Versions: 3.3.1 > Environment: Production > Reporter: karthickthavasiraj09 > Priority: Blocker > Labels: BLOCKER, important > > We had an issue while reading data from the Azure Event hubs through Azure > Databricks. After working with Microsoft team they confirmed that there's an > issue from Kafka side. I'm sharing the debug logs shared by the Microsoft > team below, > That took 49m, we see that task 143 takes 46 mins _(out of the job duration > of_ > _49m 30s)_ > _25/04/15 14:21:44 INFO KafkaBatchReaderFactoryWithRowBytesAccumulator: > Creating Kafka reader topicPartition=<topic-name>-0 fromOffset=16511904 > untilOffset=16658164, for query queryId=dd660d4d-05cc-4a8e-8f93-d202ec78fec3 > runId=af7eb711-7310-4788-85b7-0977fc0756b7 batchId=73 taskId=143 > partitionId=0_ > _._ > _25/04/15 15:07:21 INFO KafkaDataConsumer: From Kafka > topicPartition=<topic-name>-0 > groupId=spark-kafka-source-da79e0fc-8ee5-40f5-a127-7b31766b3022--1737876659-executor > read 146260 records through 4314 polls (polled out 146265 records), taking > 2526471821132 nanos, during time span of 2736294068630 nanos._ > And this task is waiting for Kafka to respond for most of the time as we can > see from the threads: > _Executor task launch worker for task 0.0 in stage 147.0 (TID 143)_ > _sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)_ > _sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)_ > _sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)_ > _sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked > sun.nio.ch.EPollSelectorImpl@54f8f9b6_ > _sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)_ > _kafkashaded.org.apache.kafka.common.network.Selector.select(Selector.java:874)_ > _kafkashaded.org.apache.kafka.common.network.Selector.poll(Selector.java:465)_ > _kafkashaded.org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)_ > _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:280)_ > _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251)_ > _kafkashaded.org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)_ > _kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1759)_ > _kafkashaded.org.apache.kafka.clients.consumer.KafkaConsumer.position(KafkaConsumer.java:1717)_ > _org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.getAvailableOffsetRange(KafkaDataConsumer.scala:110)_ > _org.apache.spark.sql.kafka010.consumer.InternalKafkaConsumer.fetch(KafkaDataConsumer.scala:84)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$fetchData$1(KafkaDataConsumer.scala:593)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$$Lambda$4556/228899458.apply(Unknown > Source)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.timeNanos(KafkaDataConsumer.scala:696)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchData(KafkaDataConsumer.scala:593)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.fetchRecord(KafkaDataConsumer.scala:517)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:325)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$$Lambda$4491/342980175.apply(Unknown > Source)_ > _org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:686)_ > _org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:301)_ > _org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:106)_ > _._ -- This message was sent by Atlassian Jira (v8.20.10#820010)