[ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054055#comment-16054055
 ] 

Sean Owen commented on SPARK-21140:
-----------------------------------

Yes, it's possible the executor makes a copy of some data during processing. 
Given overhead of serializing data and merging intermediate buffers, it could 
be largeish.
This isn't a very minimal example, and it doesn't establish that something runs 
out of memory.
There is also no proposal here about what it is that could be done differently, 
or leads about where memory is being allocated a lot: serialization? 
I don't think this is actionable as is.

> Reduce collect high memory requrements
> --------------------------------------
>
>                 Key: SPARK-21140
>                 URL: https://issues.apache.org/jira/browse/SPARK-21140
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.1.1
>         Environment: Linux Debian 8 using hadoop 2.7.2.
>            Reporter: michael procopio
>
> I wrote a very simple Scala application which used flatMap to create an RDD 
> containing a 512 mb partition of 256 byte arrays.  Experimentally, I 
> determined that spark.executor.memory had to be set at 3 gb in order to 
> colledt the data.  This seems extremely high.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to