Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Ted Yu
bq. all sort of optimizations like Tungsten

For Tungsten, please use 1.5.1 release.

On Sat, Oct 10, 2015 at 6:24 PM, Alex Rovner 
wrote:

> How many executors are you running with? How many nodes in your cluster?
>
>
> On Thursday, October 8, 2015, unk1102  wrote:
>
>> Hi as recommended I am caching my Spark job dataframe as
>> dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in
>> Spark
>> job UI is this persist stage runs for so long showing 10 GB of shuffle
>> read
>> and 5 GB of shuffle write it takes to long to finish and because of that
>> sometimes my Spark job throws timeout or throws OOM and hence executors
>> gets
>> killed by YARN. I am using Spark 1.4.1. I am using all sort of
>> optimizations
>> like Tungsten, Kryo I have given storage.memoryFraction as 0.2 and
>> storage.shuffle as 0.2 also. My data is huge around 1 TB I am using
>> default
>> 200 partitions for spark.sql.shuffle.partitions. Please help me I am
>> clueless please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-dataframe-persist-StorageLevels-MEMORY-AND-DISK-SER-hangs-for-long-time-tp24981.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
>


Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Alex Rovner
How many executors are you running with? How many nodes in your cluster?

On Thursday, October 8, 2015, unk1102  wrote:

> Hi as recommended I am caching my Spark job dataframe as
> dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in
> Spark
> job UI is this persist stage runs for so long showing 10 GB of shuffle read
> and 5 GB of shuffle write it takes to long to finish and because of that
> sometimes my Spark job throws timeout or throws OOM and hence executors
> gets
> killed by YARN. I am using Spark 1.4.1. I am using all sort of
> optimizations
> like Tungsten, Kryo I have given storage.memoryFraction as 0.2 and
> storage.shuffle as 0.2 also. My data is huge around 1 TB I am using default
> 200 partitions for spark.sql.shuffle.partitions. Please help me I am
> clueless please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-dataframe-persist-StorageLevels-MEMORY-AND-DISK-SER-hangs-for-long-time-tp24981.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

-- 
*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *


Re: Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-10 Thread Umesh Kacha
Hi Alex thanks for the response. I am using 40 executor with 30 gb
including 5 gb menoryOverhead and 4 cores. My cluster has around 100 nodes
with 30 gig and 8 cores.
On Oct 11, 2015 06:54, "Alex Rovner"  wrote:

> How many executors are you running with? How many nodes in your cluster?
>
> On Thursday, October 8, 2015, unk1102  wrote:
>
>> Hi as recommended I am caching my Spark job dataframe as
>> dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in
>> Spark
>> job UI is this persist stage runs for so long showing 10 GB of shuffle
>> read
>> and 5 GB of shuffle write it takes to long to finish and because of that
>> sometimes my Spark job throws timeout or throws OOM and hence executors
>> gets
>> killed by YARN. I am using Spark 1.4.1. I am using all sort of
>> optimizations
>> like Tungsten, Kryo I have given storage.memoryFraction as 0.2 and
>> storage.shuffle as 0.2 also. My data is huge around 1 TB I am using
>> default
>> 200 partitions for spark.sql.shuffle.partitions. Please help me I am
>> clueless please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-dataframe-persist-StorageLevels-MEMORY-AND-DISK-SER-hangs-for-long-time-tp24981.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
> --
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
>


Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-08 Thread unk1102
Hi as recommended I am caching my Spark job dataframe as
dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in Spark
job UI is this persist stage runs for so long showing 10 GB of shuffle read
and 5 GB of shuffle write it takes to long to finish and because of that
sometimes my Spark job throws timeout or throws OOM and hence executors gets
killed by YARN. I am using Spark 1.4.1. I am using all sort of optimizations
like Tungsten, Kryo I have given storage.memoryFraction as 0.2 and
storage.shuffle as 0.2 also. My data is huge around 1 TB I am using default
200 partitions for spark.sql.shuffle.partitions. Please help me I am
clueless please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-dataframe-persist-StorageLevels-MEMORY-AND-DISK-SER-hangs-for-long-time-tp24981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org