Quanlong Huang created ORC-1144:
-----------------------------------

             Summary: [C++] Provide options to trim duplicated values for reader
                 Key: ORC-1144
                 URL: https://issues.apache.org/jira/browse/ORC-1144
             Project: ORC
          Issue Type: New Feature
          Components: C++
            Reporter: Quanlong Huang


In case of count-distinct queries, clients just want the distinct values of a 
column. E.g.
{code:sql}
select count(distinct shipmode) from tpch.lineitem{code}
Column readers can try their best to trim duplicated values in advance. It'd be 
nice if we have an option/indicator for this purpose.

For dictionary encoded string columns, column readers just need to materailze 
the dictionary.
For numeric columns that only have PRESENT and DATA streams, the DATA stream 
decoder can make good use of the encoding types, e.g. SHORT_REPEAT in RLEv2. 
The PRESENT stream can be skipped as well.

Similar to ORC-1143 and ORC-450, we can extend the ReadIntent to indicate what 
the users want to do on the results. E.g. adding a ReadIntend_DISTINCT type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to