How to optimize group by query fired using hiveContext.sql?

unk1102 Sat, 03 Oct 2015 03:20:38 -0700

Hi I have couple of Spark jobs which uses group by query which is getting
fired from hiveContext.sql() Now I know group by is evil but my use case I
cant avoid group by I have around 7-8 fields on which I need to do group by.
Also I am using df1.except(df2) which also seems heavy operation and does
lots of shuffling please see my UI snap
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg>


I have tried almost all optimisation including Spark 1.5 but nothing seems
to be working and my job fails hangs because of executor will reach physical
memory limit and YARN will kill it. I have around 1TB of data to process and
it is skewed. Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to optimize group by query fired using hiveContext.sql?

Reply via email to