Hamel Ajay Kothari created SPARK-18479: ------------------------------------------
Summary: spark.sql.shuffle.partitions defaults should be a prime number Key: SPARK-18479 URL: https://issues.apache.org/jira/browse/SPARK-18479 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Hamel Ajay Kothari For most hash bucketing use cases it is my understanding that a prime value, such as 199, would be a safer value than the existing value of 200. Using a non-prime value makes the likelihood of collisions much higher when the hash function isn't great. Consider the case where you've got a Timestamp or Long column with millisecond times at midnight each day. With the default value for spark.sql.shuffle.partitions, you'll end up with 120/200 partitions being completely empty. Looking around there doesn't seem to be a good reason why we chose 200 so I don't see a huge risk in changing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org