Quanlong Huang created ORC-1144:
-----------------------------------
Summary: [C++] Provide options to trim duplicated values for reader
Key: ORC-1144
URL: https://issues.apache.org/jira/browse/ORC-1144
Project: ORC
Issue Type: New Feature
Components: C++
Reporter: Quanlong Huang
In case of count-distinct queries, clients just want the distinct values of a
column. E.g.
{code:sql}
select count(distinct shipmode) from tpch.lineitem{code}
Column readers can try their best to trim duplicated values in advance. It'd be
nice if we have an option/indicator for this purpose.
For dictionary encoded string columns, column readers just need to materailze
the dictionary.
For numeric columns that only have PRESENT and DATA streams, the DATA stream
decoder can make good use of the encoding types, e.g. SHORT_REPEAT in RLEv2.
The PRESENT stream can be skipped as well.
Similar to ORC-1143 and ORC-450, we can extend the ReadIntent to indicate what
the users want to do on the results. E.g. adding a ReadIntend_DISTINCT type.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)