[jira] [Commented] (SPARK-22584) dataframe write partitionBy out of disk/java heap issues

Sean Owen (JIRA) Thu, 23 Nov 2017 07:11:37 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264448#comment-16264448
 ]


Sean Owen commented on SPARK-22584:
-----------------------------------

It depends on too many things: what did you transform the data into? did you 
cache it? how much memory is actually allocated to Spark? driver vs executor? 
what ran out of memory, where? This is too open ended for a JIRA.

> dataframe write partitionBy out of disk/java heap issues
> --------------------------------------------------------
>
>                 Key: SPARK-22584
>                 URL: https://issues.apache.org/jira/browse/SPARK-22584
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Derek M Miller
>
> I have been seeing some issues with partitionBy for the dataframe writer. I 
> currently have a file that is 6mb, just for testing, and it has around 1487 
> rows and 21 columns. There is nothing out of the ordinary with the columns, 
> having either a DoubleType or StringType. The partitionBy calls two different 
> partitions with verified low cardinality. One partition has 30 unique values 
> and the other one has 2 unique values.
> ```scala
>         df
>             .write.partitionBy("first", "second")
>             .mode(SaveMode.Overwrite)
>             .parquet(s"$location$example/$corrId/")
> ```
> When running this example on Amazon's EMR with 5 r4.xlarges (30 gb of memory 
> each), I am getting a java heap out of memory error. I have 
> maximizeResourceAllocation set, and verified on the instances. I have even 
> set it to false, explicitly set the driver and executor memory to 16g, but 
> still had the same issue. Occasionally I get an error about disk space, and 
> the job seems to work if I use an r3.xlarge (that has the ssd). But that 
> seems weird that 6mb of data needs to spill to disk.
> The problem mainly seems to be centered around two + partitions vs 1. If I 
> just use either of the partitions only, I have no problems. It's also worth 
> noting that each of the partitions are evenly distributed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22584) dataframe write partitionBy out of disk/java heap issues

Reply via email to