lichenglin created SPARK-13999:
----------------------------------

             Summary: Run 'group by'  before building cube
                 Key: SPARK-13999
                 URL: https://issues.apache.org/jira/browse/SPARK-13999
             Project: Spark
          Issue Type: Improvement
            Reporter: lichenglin


When I'm trying to build a cube on a data set witch has about 1 billion count.
The cube has 7 dimensions.
It takes a whole day to finish the job with 16 cores;

Then I run the 'select count (1) from table group by A,B,C,D,E,F,G' first
and run the cube with the 'group by' result data set.
The dimensions is the same as 'group by' and do sum on 'count'.
It just need 45 minutes.

the group by will reduce the data set's count from billions to  millions.
This depends on  the number  of dimension.

We can try in the new version.

The process of averaging may be complex.Should get the sum and count during the 
group by .

  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to