Handling load distribution and addressing data skew.

2024-08-16 Thread Karthick
Hi Team, I'm using repartition and sortWithinPartitions to maintain field-based ordering across partitions, but I'm facing data skewness among the partitions. I have 96 partitions, and I'm working with 500 distinct keys. While reviewing the Spark UI, I noticed that a few partitions are underutiliz

Re: Redundant(?) shuffle after join

2024-08-16 Thread Mich Talebzadeh
Hi Shay, Let me address the points you raised using the STAR methodology. I apologize if it sounds a bit formal, but I find it effective for clarity. *Situation* You encountered an issue while working with a Spark DataFrame where a shuffle was unexpectedly triggered during the application of a w

Re: Redundant(?) shuffle after join

2024-08-16 Thread Shay Elbaz
Hi Mich, thank you for answering - much appreciated. This can cause uneven distribution of data, triggering a shuffle for the window function. Could you elaborate on the mechanism that can "trigger a shuffle for the window function"? I'm not familiar with it. (or are you referring to AQE?) In an

Issue with pyspark : Add custom shutdown hook

2024-08-16 Thread aarushi agarwal
Hi Team, I am trying to add a shutdown hook with the pyspark script using `*atexit*`. However, it seems like whenever I send a SIGTERM to the spark-submit process, it triggers the JVM shutdown hook first which results in terminating the spark context. I didn't understand in what order the python