Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
I see you reposted with more details:

"I have 2 TB of
skewed data to process and then convert rdd into dataframe and use it as
table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4
cores."

If I'm reading that correctly, you have 2TB of data and 1.2TB of memory in
the cluster. I think that's a fundamental problem up front. If it's skewed
then that will be even worse for doing aggregation. I think to start the
data either needs to be broken down or the cluster upgraded unfortunately.

On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher 
wrote:

> Do you have any details about the cluster you are running this against?
> The memory per executor/node, number of executors, and such? Even at a
> shuffle setting of 1000 that would be roughly 1GB per partition assuming
> the 1TB of data includes overheads in the JVM. Maybe try another order of
> magnitude higher for number of shuffle partitions and see where that gets
> you?
>
> On Tue, Sep 1, 2015 at 12:11 PM, unk1102  wrote:
>
>> Hi I am using Spark SQL actually hiveContext.sql() which uses group by
>> queries and I am running into OOM issues. So thinking of increasing value
>> of
>> spark.sql.shuffle.partition from 200 default to 1000 but it is not
>> helping.
>> Please correct me if I am wrong this partitions will share data shuffle
>> load
>> so more the partitions less data to hold. Please guide I am new to Spark.
>> I
>> am using Spark 1.4.0 and I have around 1TB of uncompressed data to process
>> using hiveContext.sql() group by queries.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>



-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Re: What should be the optimal value for spark.sql.shuffle.partition?

2015-09-09 Thread Richard Marscher
Do you have any details about the cluster you are running this against? The
memory per executor/node, number of executors, and such? Even at a shuffle
setting of 1000 that would be roughly 1GB per partition assuming the 1TB of
data includes overheads in the JVM. Maybe try another order of magnitude
higher for number of shuffle partitions and see where that gets you?

On Tue, Sep 1, 2015 at 12:11 PM, unk1102  wrote:

> Hi I am using Spark SQL actually hiveContext.sql() which uses group by
> queries and I am running into OOM issues. So thinking of increasing value
> of
> spark.sql.shuffle.partition from 200 default to 1000 but it is not helping.
> Please correct me if I am wrong this partitions will share data shuffle
> load
> so more the partitions less data to hold. Please guide I am new to Spark. I
> am using Spark 1.4.0 and I have around 1TB of uncompressed data to process
> using hiveContext.sql() group by queries.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by
queries and I am running into OOM issues. So thinking of increasing value of
spark.sql.shuffle.partition from 200 default to 1000 but it is not helping.
Please correct me if I am wrong this partitions will share data shuffle load
so more the partitions less data to hold. Please guide I am new to Spark. I
am using Spark 1.4.0 and I have around 1TB of uncompressed data to process
using hiveContext.sql() group by queries.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org