[ https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-9872: ----------------------------- Component/s: SQL > Allow passing of 'numPartitions' to DataFrame joins > --------------------------------------------------- > > Key: SPARK-9872 > URL: https://issues.apache.org/jira/browse/SPARK-9872 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.4.1 > Reporter: Al M > Priority: Minor > > When I join two normal RDDs, I can set the number of shuffle partitions in > the 'numPartitions' argument. When I join two DataFrames I do not have this > option. > My spark job loads in 2 large files and 2 small files. When I perform a > join, this will use the "spark.sql.shuffle.partitions" to determine the > number of partitions. This means that the join with my small files will use > exactly the same number of partitions as the join with my large files. > I can either use a low number of partitions and run out of memory on my large > join, or use a high number of partitions and my small join will take far too > long. > If we were able to specify the number of shuffle partitions in a DataFrame > join like in an RDD join, this would not be an issue. > My long term ideal solution would be dynamic partition determination as > described in SPARK-4630. However I appreciate that it is not particularly > easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org