Re: What should be the optimal value for spark.sql.shuffle.partition?

Richard Marscher Wed, 09 Sep 2015 15:02:31 -0700

I see you reposted with more details:

"I have 2 TB of
skewed data to process and then convert rdd into dataframe and use it as
table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4
cores."


If I'm reading that correctly, you have 2TB of data and 1.2TB of memory in
the cluster. I think that's a fundamental problem up front. If it's skewed
then that will be even worse for doing aggregation. I think to start the
data either needs to be broken down or the cluster upgraded unfortunately.

On Wed, Sep 9, 2015 at 5:41 PM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> Do you have any details about the cluster you are running this against?
> The memory per executor/node, number of executors, and such? Even at a
> shuffle setting of 1000 that would be roughly 1GB per partition assuming
> the 1TB of data includes overheads in the JVM. Maybe try another order of
> magnitude higher for number of shuffle partitions and see where that gets
> you?
>
> On Tue, Sep 1, 2015 at 12:11 PM, unk1102 <umesh.ka...@gmail.com> wrote:
>
>> Hi I am using Spark SQL actually hiveContext.sql() which uses group by
>> queries and I am running into OOM issues. So thinking of increasing value
>> of
>> spark.sql.shuffle.partition from 200 default to 1000 but it is not
>> helping.
>> Please correct me if I am wrong this partitions will share data shuffle
>> load
>> so more the partitions less data to hold. Please guide I am new to Spark.
>> I
>> am using Spark 1.4.0 and I have around 1TB of uncompressed data to process
>> using hiveContext.sql() group by queries.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-optimal-value-for-spark-sql-shuffle-partition-tp24543.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>



-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: What should be the optimal value for spark.sql.shuffle.partition?

Reply via email to