This seems quite interesting.  Drill does row group pruning, but doing the
page level pruning based on indexes would be big win.
Also, as you may know, Drill recently added a feature to leverage secondary
indexes in NoSQL databases [1].  However, we have to see whether
that capability applies to the Parquet index since the Parquet index is
local to each file.

Please create a JIRA and add your input into it.  Thanks.

[1] https://issues.apache.org/jira/browse/DRILL-6381

On Wed, Dec 12, 2018 at 10:30 AM Lou kevin <lou.kev...@gmail.com> wrote:

> Hi, I am a drill user and use parquet as the store format.
> I have known some new feature has been added to the latest Parquet Format.
> The new Parquet feature of column indexes seams very attractive and is
> there any plan to be supported in drill?
>
> thanks very much!
>
> the feature detail:
> https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250
> See https://issues.apache.org/jira/browse/PARQUET-1201
>
> And the goals: make both range scans and point lookups I/O efficient by
> allowing direct access to pages based on their min and max values. In
> particular:
> 1.A single-row lookup in a rowgroup based on the sort column of that
> rowgroup will only read one data page per retrieved column. Range scans on
> the sort column will only need to read the exact data pages that contain
> relevant data.
> 2.Make other selective scans I/O efficient: if we have a very selective
> predicate on a non-sorting column, for the other retrieved columns we
> should only need to access data pages that contain matching rows.
> 3.No additional decoding effort for scans without selective predicates,
> e.g., full-row group scans. If a reader determines that it does not need to
> read the index data, it does not incur any overhead.
> 4.Index pages for sorted columns use minimal storage by storing only the
> boundary elements between pages.
>

Reply via email to