I believe you will gain more understanding if you look at or use mapPartitions()
Regards Sab On 15-Feb-2016 8:38 am, "Christopher Brady" <christopher.br...@oracle.com> wrote: > I tried it without the cache, but it didn't change anything. The reason > for the cache is that other actions will be performed on this RDD, even > though it never gets that far. > > I can make it work by just increasing the number of partitions, but I was > hoping to get a better understanding of how Spark works rather that just > use trial and error every time I hit this issue. > > > ----- Original Message ----- > From: silvio.fior...@granturing.com > To: christopher.br...@oracle.com, ko...@tresata.com > Cc: user@spark.apache.org > Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern > Subject: RE: coalesce and executor memory > > Actually, rereading your email I see you're caching. But ‘cache’ uses > MEMORY_ONLY. Do you see errors about losing partitions as your job is > running? > > > > Are you sure you need to cache if you're just saving to disk? Can you try > the coalesce without cache? > > > > > > *From: *Christopher Brady <christopher.br...@oracle.com> > *Sent: *Friday, February 12, 2016 8:34 PM > *To: *Koert Kuipers <ko...@tresata.com>; Silvio Fiorito > <silvio.fior...@granturing.com> > *Cc: *user <user@spark.apache.org> > *Subject: *Re: coalesce and executor memory > > > Thank you for the responses. The map function just changes the format of > the record slightly, so I don't think that would be the cause of the memory > problem. > > So if I have 3 cores per executor, I need to be able to fit 3 partitions > per executor within whatever I specify for the executor memory? Is there a > way I can programmatically find a number of partitions I can coalesce down > to without running out of memory? Is there some documentation where this is > explained? > > > On 02/12/2016 05:10 PM, Koert Kuipers wrote: > > in spark, every partition needs to fit in the memory available to the core > processing it. > > as you coalesce you reduce number of partitions, increasing partition > size. at some point the partition no longer fits in memory. > > On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Coalesce essentially reduces parallelism, so fewer cores are getting more >> records. Be aware that it could also lead to loss of data locality, >> depending on how far you reduce. Depending on what you’re doing in the map >> operation, it could lead to OOM errors. Can you give more details as to >> what the code for the map looks like? >> >> >> >> >> On 2/12/16, 1:13 PM, "Christopher Brady" < <christopher.br...@oracle.com> >> christopher.br...@oracle.com> wrote: >> >> >Can anyone help me understand why using coalesce causes my executors to >> >crash with out of memory? What happens during coalesce that increases >> >memory usage so much? >> > >> >If I do: >> >hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile >> > >> >everything works fine, but if I do: >> >hadoopFile -> sample -> coalesce -> cache -> map -> >> saveAsNewAPIHadoopFile >> > >> >my executors crash with out of memory exceptions. >> > >> >Is there any documentation that explains what causes the increased >> >memory requirements with coalesce? It seems to be less of a problem if I >> >coalesce into a larger number of partitions, but I'm not sure why this >> >is. How would I estimate how much additional memory the coalesce >> requires? >> > >> >Thanks. >> > >> >--------------------------------------------------------------------- >> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >For additional commands, e-mail: <user-h...@spark.apache.org> >> user-h...@spark.apache.org >> > >> > > >