[jira] [Updated] (HIVE-4676) Optimize COUNT(*) aggregate over vectorized ORC execution path

Eric Hanson (JIRA) Mon, 10 Jun 2013 16:01:43 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Hanson updated HIVE-4676:
------------------------------

    Description: 
The COUNT\(*\) aggregate with the vectorized execution path over ORC should be 
optimized because it is a very common case. The same problem described below 
also happens with COUNT(1) and that should be optimized as well.

Given a table factsqlengineam_vec_orc with about 25 columns and 218 million 
rows, this query

select count\(*\) from factsqlengineam_vec_orc;

runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555

and this query

select count(mrowflag) from factsqlengineam_vec_orc;

runs in 42 seconds, with HDFS Read 1,207,855

Because the column mrowflag is non-null, both queries return the same result.

We should optimize count\(\*\) so that it, say, chooses the most-compressed 
column from the ORC file (or even a single random column) and counts those 
values, but logically counts null values too so the meaning is the same as 
count\(*\). The vectorized iterator should not have to load all columns, just 
one column minimum, and any columns being filtered in the WHERE clause.

For scalar count\(*\) aggregates (i.e. without group-by) we can simply tally up 
the total number of remaining rows in each batch, without even looking at the 
data. Maybe we're already doing that but we are reading more data than 
necessary now.

The query 

select count\(\*\) from factsqlengineam_vec_orc where mrowflag > 0;

Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when 
there is no WHERE clause, more data is read.

  was:
The COUNT\(*\) aggregate with the vectorized execution path over ORC should be 
optimized because it is a very common case.

Given a table factsqlengineam_vec_orc with about 25 columns and 218 million 
rows, this query

select count\(*\) from factsqlengineam_vec_orc;

runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555

and this query

select count(mrowflag) from factsqlengineam_vec_orc;

runs in 42 seconds, with HDFS Read 1,207,855

Because the column mrowflag is non-null, both queries return the same result.

We should optimize count\(\*\) so that it, say, chooses the most-compressed 
column from the ORC file (or even a single random column) and counts those 
values, but logically counts null values too so the meaning is the same as 
count\(*\). The vectorized iterator should not have to load all columns, just 
one column minimum, and any columns being filtered in the WHERE clause.

For scalar count\(*\) aggregates (i.e. without group-by) we can simply tally up 
the total number of remaining rows in each batch, without even looking at the 
data. Maybe we're already doing that but we are reading more data than 
necessary now.

The query 

select count\(\*\) from factsqlengineam_vec_orc where mrowflag > 0;

Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when 
there is no WHERE clause, more data is read.

    
> Optimize COUNT(*) aggregate over vectorized ORC execution path
> --------------------------------------------------------------
>
>                 Key: HIVE-4676
>                 URL: https://issues.apache.org/jira/browse/HIVE-4676
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>    Affects Versions: vectorization-branch
>            Reporter: Eric Hanson
>
> The COUNT\(*\) aggregate with the vectorized execution path over ORC should 
> be optimized because it is a very common case. The same problem described 
> below also happens with COUNT(1) and that should be optimized as well.
> Given a table factsqlengineam_vec_orc with about 25 columns and 218 million 
> rows, this query
> select count\(*\) from factsqlengineam_vec_orc;
> runs in 2 minutes 15 seconds, with HDFS Read 2,000,078,555
> and this query
> select count(mrowflag) from factsqlengineam_vec_orc;
> runs in 42 seconds, with HDFS Read 1,207,855
> Because the column mrowflag is non-null, both queries return the same result.
> We should optimize count\(\*\) so that it, say, chooses the most-compressed 
> column from the ORC file (or even a single random column) and counts those 
> values, but logically counts null values too so the meaning is the same as 
> count\(*\). The vectorized iterator should not have to load all columns, just 
> one column minimum, and any columns being filtered in the WHERE clause.
> For scalar count\(*\) aggregates (i.e. without group-by) we can simply tally 
> up the total number of remaining rows in each batch, without even looking at 
> the data. Maybe we're already doing that but we are reading more data than 
> necessary now.
> The query 
> select count\(\*\) from factsqlengineam_vec_orc where mrowflag > 0;
> Takes only 41 seconds and also reads 1,207,855 bytes, so it appears that when 
> there is no WHERE clause, more data is read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-4676) Optimize COUNT(*) aggregate over vectorized ORC execution path

Reply via email to