Eric Hanson created HIVE-4676:
---------------------------------
Summary: Optimize COUNT(*) aggregate over vectorized ORC execution
path
Key: HIVE-4676
URL: https://issues.apache.org/jira/browse/HIVE-4676
Project: Hive
Issue Type: Sub-task
Components: Query Processor
Affects Versions: vectorization-branch
Reporter: Eric Hanson
The COUNT(*) aggregate with the vectorized execution path over ORC should be
optimized because it is a very common case.
Given a table factsqlengineam_vec_orc with about 25 columns and 218 million
rows, this query
select count(*) from factsqlengineam_vec_orc;
runs in 2 minutes 15 seconds
and this query
select count(mrowflag) from factsqlengineam_vec_orc;
runs in 42 seconds.
Because the column mrowflag is non-null, both queries return the same result.
We should optimize count(*) so that it, say, chooses the most-compressed column
from the ORC file (or even a single random column) and counts those values, but
logically counts null values too so the meaning is the same as count(*). The
vectorized iterator should not have to load all columns, just one column
minimum, and any columns being filtered in the WHERE clause.
For scalar count(*) aggregates (i.e. without group-by) we can simply tally up
the total number of remaining rows in each batch, without even looking at the
data. Maybe we're already doing that but something is taking up extra time now.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira