[ https://issues.apache.org/jira/browse/SPARK-18479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-18479. ---------------------------------- Resolution: Won't Fix I am resolving this assuming there is no explicit objection on ^. > spark.sql.shuffle.partitions defaults should be a prime number > -------------------------------------------------------------- > > Key: SPARK-18479 > URL: https://issues.apache.org/jira/browse/SPARK-18479 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.0.2 > Reporter: Hamel Ajay Kothari > Priority: Minor > > For most hash bucketing use cases it is my understanding that a prime value, > such as 199, would be a safer value than the existing value of 200. Using a > non-prime value makes the likelihood of collisions much higher when the hash > function isn't great. > Consider the case where you've got a Timestamp or Long column with > millisecond times at midnight each day. With the default value for > spark.sql.shuffle.partitions, you'll end up with 120/200 partitions being > completely empty. > Looking around there doesn't seem to be a good reason why we chose 200 so I > don't see a huge risk in changing it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org