Mihir Sahu created SPARK-24650: ---------------------------------- Summary: GroupingSet Key: SPARK-24650 URL: https://issues.apache.org/jira/browse/SPARK-24650 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Environment: CDH 5.X, Spark 2.3 Reporter: Mihir Sahu
If a grouping set is used in spark sql, then the plan does not perform optimally. If input to a grouping set is X rows and the grouping sets has y group, then the number of rows that are processed is currently x*y rows. Example : Let a Dataframe have col1, col2, col3 and col4 columns and number of row be rowNo. and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2 Number of row processed in such case is 3*(rowNos * size of each row). However is this the optimal way of processing data. If the groups of y are derivable for each other, can we reduce the amount of volume processed by removing columns as we progress to the lower dimension of processing. Currently while doing processing percentile, a lot of data seems to be processed causing performance issue. Need to look if this can be optimised -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org