[ 
https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701158#comment-14701158
 ] 

Al M commented on SPARK-9872:
-----------------------------

I would also be happy if we just get the partition count from the parents.  
That would be even better than setting it manually since that's all my code 
would be doing anyway.

Right now it is always using a fixed default number, which isn't much use if my 
spark application uses lots of different files of different sizes.

> Allow passing of 'numPartitions' to DataFrame joins
> ---------------------------------------------------
>
>                 Key: SPARK-9872
>                 URL: https://issues.apache.org/jira/browse/SPARK-9872
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.1
>            Reporter: Al M
>            Priority: Minor
>
> When I join two normal RDDs, I can set the number of shuffle partitions in 
> the 'numPartitions' argument.  When I join two DataFrames I do not have this 
> option.
> My spark job loads in 2 large files and 2 small files.  When I perform a 
> join, this will use the "spark.sql.shuffle.partitions" to determine the 
> number of partitions.  This means that the join with my small files will use 
> exactly the same number of partitions as the join with my large files.
> I can either use a low number of partitions and run out of memory on my large 
> join, or use a high number of partitions and my small join will take far too 
> long.
> If we were able to specify the number of shuffle partitions in a DataFrame 
> join like in an RDD join, this would not be an issue.
> My long term ideal solution would be dynamic partition determination as 
> described in SPARK-4630.  However I appreciate that it is not particularly 
> easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to