I believe you will gain more understanding if you look at or use
mapPartitions()

Regards
Sab
On 15-Feb-2016 8:38 am, "Christopher Brady" <christopher.br...@oracle.com>
wrote:

> I tried it without the cache, but it didn't change anything. The reason
> for the cache is that other actions will be performed on this RDD, even
> though it never gets that far.
>
> I can make it work by just increasing the number of partitions, but I was
> hoping to get a better understanding of how Spark works rather that just
> use trial and error every time I hit this issue.
>
>
> ----- Original Message -----
> From: silvio.fior...@granturing.com
> To: christopher.br...@oracle.com, ko...@tresata.com
> Cc: user@spark.apache.org
> Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern
> Subject: RE: coalesce and executor memory
>
> Actually, rereading your email I see you're caching. But ‘cache’ uses
> MEMORY_ONLY. Do you see errors about losing partitions as your job is
> running?
>
>
>
> Are you sure you need to cache if you're just saving to disk? Can you try
> the coalesce without cache?
>
>
>
>
>
> *From: *Christopher Brady <christopher.br...@oracle.com>
> *Sent: *Friday, February 12, 2016 8:34 PM
> *To: *Koert Kuipers <ko...@tresata.com>; Silvio Fiorito
> <silvio.fior...@granturing.com>
> *Cc: *user <user@spark.apache.org>
> *Subject: *Re: coalesce and executor memory
>
>
> Thank you for the responses. The map function just changes the format of
> the record slightly, so I don't think that would be the cause of the memory
> problem.
>
> So if I have 3 cores per executor, I need to be able to fit 3 partitions
> per executor within whatever I specify for the executor memory? Is there a
> way I can programmatically find a number of partitions I can coalesce down
> to without running out of memory? Is there some documentation where this is
> explained?
>
>
> On 02/12/2016 05:10 PM, Koert Kuipers wrote:
>
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
> as you coalesce you reduce number of partitions, increasing partition
> size. at some point the partition no longer fits in memory.
>
> On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Coalesce essentially reduces parallelism, so fewer cores are getting more
>> records. Be aware that it could also lead to loss of data locality,
>> depending on how far you reduce. Depending on what you’re doing in the map
>> operation, it could lead to OOM errors. Can you give more details as to
>> what the code for the map looks like?
>>
>>
>>
>>
>> On 2/12/16, 1:13 PM, "Christopher Brady" < <christopher.br...@oracle.com>
>> christopher.br...@oracle.com> wrote:
>>
>> >Can anyone help me understand why using coalesce causes my executors to
>> >crash with out of memory? What happens during coalesce that increases
>> >memory usage so much?
>> >
>> >If I do:
>> >hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>> >
>> >everything works fine, but if I do:
>> >hadoopFile -> sample -> coalesce -> cache -> map ->
>> saveAsNewAPIHadoopFile
>> >
>> >my executors crash with out of memory exceptions.
>> >
>> >Is there any documentation that explains what causes the increased
>> >memory requirements with coalesce? It seems to be less of a problem if I
>> >coalesce into a larger number of partitions, but I'm not sure why this
>> >is. How would I estimate how much additional memory the coalesce
>> requires?
>> >
>> >Thanks.
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >For additional commands, e-mail: <user-h...@spark.apache.org>
>> user-h...@spark.apache.org
>> >
>>
>
>
>

Reply via email to