I tried it without the cache, but it didn't change anything. The reason for the 
cache is that other actions will be performed on this RDD, even though it never 
gets that far. 

I can make it work by just increasing the number of partitions, but I was 
hoping to get a better understanding of how Spark works rather that just use 
trial and error every time I hit this issue. 


----- Original Message ----- 
From: silvio.fior...@granturing.com 
To: christopher.br...@oracle.com, ko...@tresata.com 
Cc: user@spark.apache.org 
Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern 
Subject: RE: coalesce and executor memory 





Actually, rereading your email I see you're caching. But ‘cache’ uses 
MEMORY_ONLY. Do you see errors about losing partitions as your job is running? 



Are you sure you need to cache if you're just saving to disk? Can you try the 
coalesce without cache? 






From: Christopher Brady 
Sent: Friday, February 12, 2016 8:34 PM 
To: Koert Kuipers ; Silvio Fiorito 
Cc: user 
Subject: Re: coalesce and executor memory 


Thank you for the responses. The map function just changes the format of the 
record slightly, so I don't think that would be the cause of the memory 
problem. 

So if I have 3 cores per executor, I need to be able to fit 3 partitions per 
executor within whatever I specify for the executor memory? Is there a way I 
can programmatically find a number of partitions I can coalesce down to without 
running out of memory? Is there some documentation where this is explained? 



On 02/12/2016 05:10 PM, Koert Kuipers wrote: 




in spark, every partition needs to fit in the memory available to the core 
processing it. 

as you coalesce you reduce number of partitions, increasing partition size. at 
some point the partition no longer fits in memory. 



On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito < silvio.fior...@granturing.com 
> wrote: 


Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like? 






On 2/12/16, 1:13 PM, "Christopher Brady" < christopher.br...@oracle.com > 
wrote: 

>Can anyone help me understand why using coalesce causes my executors to 
>crash with out of memory? What happens during coalesce that increases 
>memory usage so much? 
> 
>If I do: 
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile 
> 
>everything works fine, but if I do: 
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile 
> 
>my executors crash with out of memory exceptions. 
> 
>Is there any documentation that explains what causes the increased 
>memory requirements with coalesce? It seems to be less of a problem if I 
>coalesce into a larger number of partitions, but I'm not sure why this 
>is. How would I estimate how much additional memory the coalesce requires? 
> 
>Thanks. 
> 
>--------------------------------------------------------------------- 
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>For additional commands, e-mail: user-h...@spark.apache.org 
> 


Reply via email to