[jira] [Comment Edited] (SPARK-5997) Increase partition count without performing a shuffle

nirav patel (JIRA) Thu, 14 Mar 2019 11:57:11 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792964#comment-16792964
 ]


nirav patel edited comment on SPARK-5997 at 3/14/19 6:56 PM:
-------------------------------------------------------------

Adding another possible use case for this ask - I am hitting 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to 
write unpartitioned Dataframe to parquet. Error is due to data block exceed 2GB 
in size before writing to disk. Solution is to repartition the Dataframe 
(Dataset) . I can do it but I don't want to cause shuffle when I increase 
number of partitions with repartition API.


was (Author: tenstriker):
Adding another possible use case for this ask - I am hitting 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to 
write unpartitioned Dataframe to parquet. Error is due to shuffleblock exceed 
2GB in size. Solution is to repartition the Dataframe (Dataset) . I can do it 
but I don't want to cause shuffle when I increase number of partitions with 
repartition API.

> Increase partition count without performing a shuffle
> -----------------------------------------------------
>
>                 Key: SPARK-5997
>                 URL: https://issues.apache.org/jira/browse/SPARK-5997
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Andrew Ash
>            Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5997) Increase partition count without performing a shuffle

Reply via email to