[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792964#comment-16792964 ]
nirav patel edited comment on SPARK-5997 at 3/14/19 6:56 PM: ------------------------------------------------------------- Adding another possible use case for this ask - I am hitting IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to write unpartitioned Dataframe to parquet. Error is due to data block exceed 2GB in size before writing to disk. Solution is to repartition the Dataframe (Dataset) . I can do it but I don't want to cause shuffle when I increase number of partitions with repartition API. was (Author: tenstriker): Adding another possible use case for this ask - I am hitting IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to write unpartitioned Dataframe to parquet. Error is due to shuffleblock exceed 2GB in size. Solution is to repartition the Dataframe (Dataset) . I can do it but I don't want to cause shuffle when I increase number of partitions with repartition API. > Increase partition count without performing a shuffle > ----------------------------------------------------- > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Andrew Ash > Priority: Major > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org