Re: coalesce and executor memory

2016-02-14 Thread Sabarish Sasidharan
I believe you will gain more understanding if you look at or use
mapPartitions()

Regards
Sab
On 15-Feb-2016 8:38 am, "Christopher Brady" <christopher.br...@oracle.com>
wrote:

> I tried it without the cache, but it didn't change anything. The reason
> for the cache is that other actions will be performed on this RDD, even
> though it never gets that far.
>
> I can make it work by just increasing the number of partitions, but I was
> hoping to get a better understanding of how Spark works rather that just
> use trial and error every time I hit this issue.
>
>
> - Original Message -
> From: silvio.fior...@granturing.com
> To: christopher.br...@oracle.com, ko...@tresata.com
> Cc: user@spark.apache.org
> Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern
> Subject: RE: coalesce and executor memory
>
> Actually, rereading your email I see you're caching. But ‘cache’ uses
> MEMORY_ONLY. Do you see errors about losing partitions as your job is
> running?
>
>
>
> Are you sure you need to cache if you're just saving to disk? Can you try
> the coalesce without cache?
>
>
>
>
>
> *From: *Christopher Brady <christopher.br...@oracle.com>
> *Sent: *Friday, February 12, 2016 8:34 PM
> *To: *Koert Kuipers <ko...@tresata.com>; Silvio Fiorito
> <silvio.fior...@granturing.com>
> *Cc: *user <user@spark.apache.org>
> *Subject: *Re: coalesce and executor memory
>
>
> Thank you for the responses. The map function just changes the format of
> the record slightly, so I don't think that would be the cause of the memory
> problem.
>
> So if I have 3 cores per executor, I need to be able to fit 3 partitions
> per executor within whatever I specify for the executor memory? Is there a
> way I can programmatically find a number of partitions I can coalesce down
> to without running out of memory? Is there some documentation where this is
> explained?
>
>
> On 02/12/2016 05:10 PM, Koert Kuipers wrote:
>
> in spark, every partition needs to fit in the memory available to the core
> processing it.
>
> as you coalesce you reduce number of partitions, increasing partition
> size. at some point the partition no longer fits in memory.
>
> On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Coalesce essentially reduces parallelism, so fewer cores are getting more
>> records. Be aware that it could also lead to loss of data locality,
>> depending on how far you reduce. Depending on what you’re doing in the map
>> operation, it could lead to OOM errors. Can you give more details as to
>> what the code for the map looks like?
>>
>>
>>
>>
>> On 2/12/16, 1:13 PM, "Christopher Brady" < <christopher.br...@oracle.com>
>> christopher.br...@oracle.com> wrote:
>>
>> >Can anyone help me understand why using coalesce causes my executors to
>> >crash with out of memory? What happens during coalesce that increases
>> >memory usage so much?
>> >
>> >If I do:
>> >hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>> >
>> >everything works fine, but if I do:
>> >hadoopFile -> sample -> coalesce -> cache -> map ->
>> saveAsNewAPIHadoopFile
>> >
>> >my executors crash with out of memory exceptions.
>> >
>> >Is there any documentation that explains what causes the increased
>> >memory requirements with coalesce? It seems to be less of a problem if I
>> >coalesce into a larger number of partitions, but I'm not sure why this
>> >is. How would I estimate how much additional memory the coalesce
>> requires?
>> >
>> >Thanks.
>> >
>> >-
>> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >For additional commands, e-mail: <user-h...@spark.apache.org>
>> user-h...@spark.apache.org
>> >
>>
>
>
>


Re: coalesce and executor memory

2016-02-14 Thread Christopher Brady


I tried it without the cache, but it didn't change anything. The reason for the 
cache is that other actions will be performed on this RDD, even though it never 
gets that far. 

I can make it work by just increasing the number of partitions, but I was 
hoping to get a better understanding of how Spark works rather that just use 
trial and error every time I hit this issue. 


- Original Message - 
From: silvio.fior...@granturing.com 
To: christopher.br...@oracle.com, ko...@tresata.com 
Cc: user@spark.apache.org 
Sent: Sunday, February 14, 2016 8:27:09 AM GMT -05:00 US/Canada Eastern 
Subject: RE: coalesce and executor memory 





Actually, rereading your email I see you're caching. But ‘cache’ uses 
MEMORY_ONLY. Do you see errors about losing partitions as your job is running? 



Are you sure you need to cache if you're just saving to disk? Can you try the 
coalesce without cache? 






From: Christopher Brady 
Sent: Friday, February 12, 2016 8:34 PM 
To: Koert Kuipers ; Silvio Fiorito 
Cc: user 
Subject: Re: coalesce and executor memory 


Thank you for the responses. The map function just changes the format of the 
record slightly, so I don't think that would be the cause of the memory 
problem. 

So if I have 3 cores per executor, I need to be able to fit 3 partitions per 
executor within whatever I specify for the executor memory? Is there a way I 
can programmatically find a number of partitions I can coalesce down to without 
running out of memory? Is there some documentation where this is explained? 



On 02/12/2016 05:10 PM, Koert Kuipers wrote: 




in spark, every partition needs to fit in the memory available to the core 
processing it. 

as you coalesce you reduce number of partitions, increasing partition size. at 
some point the partition no longer fits in memory. 



On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito < silvio.fior...@granturing.com 
> wrote: 


Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like? 






On 2/12/16, 1:13 PM, "Christopher Brady" < christopher.br...@oracle.com > 
wrote: 

>Can anyone help me understand why using coalesce causes my executors to 
>crash with out of memory? What happens during coalesce that increases 
>memory usage so much? 
> 
>If I do: 
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile 
> 
>everything works fine, but if I do: 
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile 
> 
>my executors crash with out of memory exceptions. 
> 
>Is there any documentation that explains what causes the increased 
>memory requirements with coalesce? It seems to be less of a problem if I 
>coalesce into a larger number of partitions, but I'm not sure why this 
>is. How would I estimate how much additional memory the coalesce requires? 
> 
>Thanks. 
> 
>- 
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>For additional commands, e-mail: user-h...@spark.apache.org 
> 




RE: coalesce and executor memory

2016-02-14 Thread Silvio Fiorito
Actually, rereading your email I see you're caching. But ‘cache’ uses 
MEMORY_ONLY. Do you see errors about losing partitions as your job is running?

Are you sure you need to cache if you're just saving to disk? Can you try the 
coalesce without cache?


From: Christopher Brady<mailto:christopher.br...@oracle.com>
Sent: Friday, February 12, 2016 8:34 PM
To: Koert Kuipers<mailto:ko...@tresata.com>; Silvio 
Fiorito<mailto:silvio.fior...@granturing.com>
Cc: user<mailto:user@spark.apache.org>
Subject: Re: coalesce and executor memory

Thank you for the responses. The map function just changes the format of the 
record slightly, so I don't think that would be the cause of the memory problem.

So if I have 3 cores per executor, I need to be able to fit 3 partitions per 
executor within whatever I specify for the executor memory? Is there a way I 
can programmatically find a number of partitions I can coalesce down to without 
running out of memory? Is there some documentation where this is explained?


On 02/12/2016 05:10 PM, Koert Kuipers wrote:
in spark, every partition needs to fit in the memory available to the core 
processing it.

as you coalesce you reduce number of partitions, increasing partition size. at 
some point the partition no longer fits in memory.

On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> wrote:
Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like?




On 2/12/16, 1:13 PM, "Christopher Brady" 
<<mailto:christopher.br...@oracle.com>christopher.br...@oracle.com<mailto:christopher.br...@oracle.com>>
 wrote:

>Can anyone help me understand why using coalesce causes my executors to
>crash with out of memory? What happens during coalesce that increases
>memory usage so much?
>
>If I do:
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>
>everything works fine, but if I do:
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile
>
>my executors crash with out of memory exceptions.
>
>Is there any documentation that explains what causes the increased
>memory requirements with coalesce? It seems to be less of a problem if I
>coalesce into a larger number of partitions, but I'm not sure why this
>is. How would I estimate how much additional memory the coalesce requires?
>
>Thanks.
>
>-
>To unsubscribe, e-mail: 
>user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
>For additional commands, e-mail: <mailto:user-h...@spark.apache.org> 
>user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
>




Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers  wrote:

> in spark, every partition needs to fit in the memory available to the core
> processing it.
>

That does not agree with my understanding of how it works. I think you
could do sc.textFile("input").coalesce(1).map(_.replace("A",
"B")).saveAsTextFile("output") on a 1 TB local file and it would work fine.
(I just tested this with a 3 GB file with a 1 GB executor.)

RDDs are mostly implemented using iterators. For example map() operates
line-by-line, never pulling in the whole partition into memory. coalesce()
also just concatenates the iterators of the parent partitions into a new
iterator.

Some operations, like reduceByKey(), need to have the whole contents of the
partition to work. But they typically use ExternalAppendOnlyMap, so they
spill to disk instead of filling up the memory.

I know I'm not helping to answer Christopher's issue. Christopher, can you
perhaps give us an example that we can easily try in spark-shell to see the
same problem?


Re: coalesce and executor memory

2016-02-13 Thread Daniel Darabos
On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers  wrote:

> in spark, every partition needs to fit in the memory available to the core
> processing it.
>

That does not agree with my understanding of how it works. I think you
could do sc.textFile("input").coalesce(1).map(_.replace("A",
"B")).saveAsTextFile("output") on a 1 TB local file and it would work fine.
(I just tested this with a 3 GB file with a 1 GB executor.)

RDDs are mostly implemented using iterators. For example map() operates
line-by-line, never pulling in the whole partition into memory. coalesce()
also just concatenates the iterators of the parent partitions into a new
iterator.

Some operations, like reduceByKey(), need to have the whole contents of the
partition to work. But they typically use ExternalAppendOnlyMap, so they
spill to disk instead of filling up the memory.

I know I'm not helping to answer Christopher's issue. Christopher, can you
perhaps give us an example that we can easily try in spark-shell to see the
same problem?


Re: coalesce and executor memory

2016-02-13 Thread Koert Kuipers
sorry i meant to say:
and my way to deal with OOMs is almost always simply to increase number of
partitions. maybe there is a better way that i am not aware of.

On Sat, Feb 13, 2016 at 11:38 PM, Koert Kuipers  wrote:

> thats right, its the reduce operation that makes the in-memory assumption,
> not the map (although i am still suspicious that the map actually streams
> from disk to disk record by record).
>
> in reality though my experience is that is spark can not fit partitions in
> memory it doesnt work well. i get OOMs. and my to OOMs is almost always
> simply to increase number of partitions. maybe there is a better way that i
> am not aware of.
>
> On Sat, Feb 13, 2016 at 6:32 PM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>>
>> On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers 
>> wrote:
>>
>>> in spark, every partition needs to fit in the memory available to the
>>> core processing it.
>>>
>>
>> That does not agree with my understanding of how it works. I think you
>> could do sc.textFile("input").coalesce(1).map(_.replace("A",
>> "B")).saveAsTextFile("output") on a 1 TB local file and it would work fine.
>> (I just tested this with a 3 GB file with a 1 GB executor.)
>>
>> RDDs are mostly implemented using iterators. For example map() operates
>> line-by-line, never pulling in the whole partition into memory. coalesce()
>> also just concatenates the iterators of the parent partitions into a new
>> iterator.
>>
>> Some operations, like reduceByKey(), need to have the whole contents of
>> the partition to work. But they typically use ExternalAppendOnlyMap, so
>> they spill to disk instead of filling up the memory.
>>
>> I know I'm not helping to answer Christopher's issue. Christopher, can
>> you perhaps give us an example that we can easily try in spark-shell to see
>> the same problem?
>>
>
>


Re: coalesce and executor memory

2016-02-13 Thread Koert Kuipers
thats right, its the reduce operation that makes the in-memory assumption,
not the map (although i am still suspicious that the map actually streams
from disk to disk record by record).

in reality though my experience is that is spark can not fit partitions in
memory it doesnt work well. i get OOMs. and my to OOMs is almost always
simply to increase number of partitions. maybe there is a better way that i
am not aware of.

On Sat, Feb 13, 2016 at 6:32 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

>
> On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers  wrote:
>
>> in spark, every partition needs to fit in the memory available to the
>> core processing it.
>>
>
> That does not agree with my understanding of how it works. I think you
> could do sc.textFile("input").coalesce(1).map(_.replace("A",
> "B")).saveAsTextFile("output") on a 1 TB local file and it would work fine.
> (I just tested this with a 3 GB file with a 1 GB executor.)
>
> RDDs are mostly implemented using iterators. For example map() operates
> line-by-line, never pulling in the whole partition into memory. coalesce()
> also just concatenates the iterators of the parent partitions into a new
> iterator.
>
> Some operations, like reduceByKey(), need to have the whole contents of
> the partition to work. But they typically use ExternalAppendOnlyMap, so
> they spill to disk instead of filling up the memory.
>
> I know I'm not helping to answer Christopher's issue. Christopher, can you
> perhaps give us an example that we can easily try in spark-shell to see the
> same problem?
>


coalesce and executor memory

2016-02-12 Thread Christopher Brady
Can anyone help me understand why using coalesce causes my executors to 
crash with out of memory? What happens during coalesce that increases 
memory usage so much?


If I do:
hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile

everything works fine, but if I do:
hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile

my executors crash with out of memory exceptions.

Is there any documentation that explains what causes the increased 
memory requirements with coalesce? It seems to be less of a problem if I 
coalesce into a larger number of partitions, but I'm not sure why this 
is. How would I estimate how much additional memory the coalesce requires?


Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: coalesce and executor memory

2016-02-12 Thread Silvio Fiorito
Coalesce essentially reduces parallelism, so fewer cores are getting more 
records. Be aware that it could also lead to loss of data locality, depending 
on how far you reduce. Depending on what you’re doing in the map operation, it 
could lead to OOM errors. Can you give more details as to what the code for the 
map looks like?




On 2/12/16, 1:13 PM, "Christopher Brady"  wrote:

>Can anyone help me understand why using coalesce causes my executors to 
>crash with out of memory? What happens during coalesce that increases 
>memory usage so much?
>
>If I do:
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>
>everything works fine, but if I do:
>hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile
>
>my executors crash with out of memory exceptions.
>
>Is there any documentation that explains what causes the increased 
>memory requirements with coalesce? It seems to be less of a problem if I 
>coalesce into a larger number of partitions, but I'm not sure why this 
>is. How would I estimate how much additional memory the coalesce requires?
>
>Thanks.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>


Re: coalesce and executor memory

2016-02-12 Thread Koert Kuipers
in spark, every partition needs to fit in the memory available to the core
processing it.

as you coalesce you reduce number of partitions, increasing partition size.
at some point the partition no longer fits in memory.

On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

> Coalesce essentially reduces parallelism, so fewer cores are getting more
> records. Be aware that it could also lead to loss of data locality,
> depending on how far you reduce. Depending on what you’re doing in the map
> operation, it could lead to OOM errors. Can you give more details as to
> what the code for the map looks like?
>
>
>
>
> On 2/12/16, 1:13 PM, "Christopher Brady" 
> wrote:
>
> >Can anyone help me understand why using coalesce causes my executors to
> >crash with out of memory? What happens during coalesce that increases
> >memory usage so much?
> >
> >If I do:
> >hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
> >
> >everything works fine, but if I do:
> >hadoopFile -> sample -> coalesce -> cache -> map -> saveAsNewAPIHadoopFile
> >
> >my executors crash with out of memory exceptions.
> >
> >Is there any documentation that explains what causes the increased
> >memory requirements with coalesce? It seems to be less of a problem if I
> >coalesce into a larger number of partitions, but I'm not sure why this
> >is. How would I estimate how much additional memory the coalesce requires?
> >
> >Thanks.
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: coalesce and executor memory

2016-02-12 Thread Christopher Brady
Thank you for the responses. The map function just changes the format of 
the record slightly, so I don't think that would be the cause of the 
memory problem.


So if I have 3 cores per executor, I need to be able to fit 3 partitions 
per executor within whatever I specify for the executor memory? Is there 
a way I can programmatically find a number of partitions I can coalesce 
down to without running out of memory? Is there some documentation where 
this is explained?



On 02/12/2016 05:10 PM, Koert Kuipers wrote:
in spark, every partition needs to fit in the memory available to the 
core processing it.


as you coalesce you reduce number of partitions, increasing partition 
size. at some point the partition no longer fits in memory.


On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito 
> 
wrote:


Coalesce essentially reduces parallelism, so fewer cores are
getting more records. Be aware that it could also lead to loss of
data locality, depending on how far you reduce. Depending on what
you’re doing in the map operation, it could lead to OOM errors.
Can you give more details as to what the code for the map looks like?




On 2/12/16, 1:13 PM, "Christopher Brady"
> wrote:

>Can anyone help me understand why using coalesce causes my
executors to
>crash with out of memory? What happens during coalesce that increases
>memory usage so much?
>
>If I do:
>hadoopFile -> sample -> cache -> map -> saveAsNewAPIHadoopFile
>
>everything works fine, but if I do:
>hadoopFile -> sample -> coalesce -> cache -> map ->
saveAsNewAPIHadoopFile
>
>my executors crash with out of memory exceptions.
>
>Is there any documentation that explains what causes the increased
>memory requirements with coalesce? It seems to be less of a
problem if I
>coalesce into a larger number of partitions, but I'm not sure why
this
>is. How would I estimate how much additional memory the coalesce
requires?
>
>Thanks.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

>For additional commands, e-mail: user-h...@spark.apache.org

>