Re: Spark stuck at removing broadcast variable

2020-04-18 Thread Waleed Fateem
This might be obvious but just checking anyways, did you confirm whether or not all of the messages have already been consumed by Spark? If that's the case then I wouldn't expect much to happen unless new data comes into your Kafka topic. If you're a hundred percent sure that there's still plenty

Re: Spark stuck at removing broadcast variable

2020-04-18 Thread Sean Owen
I don't think that means it's stuck on removing something; it was removed. Not sure what it is waiting on - more data perhaps? On Sat, Apr 18, 2020 at 2:22 PM Alchemist wrote: > > I am running a simple Spark structured streaming application that is pulling > data from a Kafka Topic. I have a

Spark stuck at removing broadcast variable

2020-04-18 Thread Alchemist
I am running a simple Spark structured streaming application that is pulling data from a Kafka Topic. I have a Kafka Topic with nearly 1000 partitions. I am running this app on 6 node EMR cluster with 4 cores and 16GB RAM. I observed that Spark is trying to pull data from all 1024 Kafka

Re: Spark structured streaming - performance tuning

2020-04-18 Thread Alex Ott
Just to clarify - I didn't write this explicitly in my answer. When you're working with Kafka, every partition in Kafka is mapped into Spark partition. And in Spark, every partition is mapped into task. But you can use `coalesce` to decrease the number of Spark partitions, so you'll have less