Quanlong Huang created ORC-1144: ----------------------------------- Summary: [C++] Provide options to trim duplicated values for reader Key: ORC-1144 URL: https://issues.apache.org/jira/browse/ORC-1144 Project: ORC Issue Type: New Feature Components: C++ Reporter: Quanlong Huang
In case of count-distinct queries, clients just want the distinct values of a column. E.g. {code:sql} select count(distinct shipmode) from tpch.lineitem{code} Column readers can try their best to trim duplicated values in advance. It'd be nice if we have an option/indicator for this purpose. For dictionary encoded string columns, column readers just need to materailze the dictionary. For numeric columns that only have PRESENT and DATA streams, the DATA stream decoder can make good use of the encoding types, e.g. SHORT_REPEAT in RLEv2. The PRESENT stream can be skipped as well. Similar to ORC-1143 and ORC-450, we can extend the ReadIntent to indicate what the users want to do on the results. E.g. adding a ReadIntend_DISTINCT type. -- This message was sent by Atlassian Jira (v8.20.1#820001)