[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106855#comment-16106855 ]
Hyukjin Kwon commented on SPARK-21577: -------------------------------------- Please check out https://spark.apache.org/community.html. > Issue is handling too many aggregations > ---------------------------------------- > > Key: SPARK-21577 > URL: https://issues.apache.org/jira/browse/SPARK-21577 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Environment: Cloudera CDH 1.8.3 > Spark 1.6.0 > Reporter: Kannan Subramanian > > my requirement, reading the table from hive(Size - around 1.6 TB). I have to > do more than 200 aggregation operations mostly avg, sum and std_dev. Spark > application total execution time is take more than 12 hours. To Optimize the > code I used shuffle Partitioning and memory tuning and all. But Its > nothelpful for me. Please note that same query I ran in hive on map reduce. > MR job completion time taken around only 5 hours. Kindly let me know is > there any way to optimize or efficient way of handling multiple aggregation > operations. val inputDataDF = > hiveContext.read.parquet("/inputparquetData") > inputDataDF.groupBy("seq_no","year", > "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), > avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... > like 200 columns) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org