Hi I have couple of Spark jobs which uses group by query which is getting fired from hiveContext.sql() Now I know group by is evil but my use case I cant avoid group by I have around 7-8 fields on which I need to do group by. Also I am using df1.except(df2) which also seems heavy operation and does lots of shuffling please see my UI snap <http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg>
I have tried almost all optimisation including Spark 1.5 but nothing seems to be working and my job fails hangs because of executor will reach physical memory limit and YARN will kill it. I have around 1TB of data to process and it is skewed. Please guide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org