Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).
I need to aggregate data ingested every minute and I am using spark-shell at the moment. The message rate ingestion rate is approx 500k/second. During some trigger intervals (1 minute) especially when the streaming process is started, all tasks finish in 20seconds but during some triggers, it takes 90 seconds. I have tried to reduce the number of partitions approx (100 from 300) to reduce the consumers for Kafka, but that has not helped. I also tried the kafkaConsumer.pollTimeoutMs to 30 seconds but then I see a lot of java.util.concurrent.TimeoutException: Cannot fetch record for offset. So I wanted to see if anyone has any thoughts/recommendations. Thanks