parquet-format git commit: PARQUET-922: Add column indexes to parquet.thrift

blue Mon, 16 Oct 2017 16:48:00 -0700

Repository: parquet-format
Updated Branches:
  refs/heads/master 65f105707 -> f1de77d31



PARQUET-922: Add column indexes to parquet.thrift

I moved the design doc to a .md file and addressed the first round of review 
comments.

closes #63

This is based on work done by @mkornacker and @lekv who wrote the initial 
proposal and @poojanilangekar who evolved the design, wrote a prototypical 
implementation, and evaluated its performance.

Author: Lars Volker <l...@cloudera.com>
Author: poojanilangekar <nilangekar.po...@gmail.com>
Author: Lars Volker <lvol...@gmail.com>

Closes #72 from lekv/index and squashes the following commits:

babb356 [Lars Volker] Address comments from Marcel and Zoltan.
6897c2b [Lars Volker] Address Marcel's comments.
bbb3670 [Lars Volker] Reinstate PageIndex.md
ebcb33f [Lars Volker] Revert "Extend comments in parquet.thrift, remove 
PageIndex.md"
877e14c [Lars Volker] Revert "Remove picture"
5df2bbc [Lars Volker] Remove picture
a39bf49 [Lars Volker] Extend comments in parquet.thrift, remove PageIndex.md
9ea100a [Lars Volker] Address comments from Zoltan.
9f79d72 [Lars Volker] Merge branch 'master' into index
5e8ea1c [Lars Volker] Fix Typo
da6f648 [Lars Volker] Addressing more comments
8541da7 [Lars Volker] Addressing review comments from the Parquet sync meeting
8e3c533 [Lars Volker] More review comments
109b20d [Lars Volker] Address more review comments, clarify the description of 
ColumnIndex
f5bfe55 [Lars Volker] Address review comments on parquet.thrift.
700cc00 [Lars Volker] PARQUET-922: Add documentation on page indexes
f983794 [poojanilangekar] PARQUET-922: ColumnIndex Layout to Support Page 
Skipping


Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/f1de77d3
Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/f1de77d3
Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/f1de77d3

Branch: refs/heads/master
Commit: f1de77d31936f4d50f1286676a0034b6339918ee
Parents: 65f1057
Author: Lars Volker <l...@cloudera.com>
Authored: Mon Oct 16 16:47:12 2017 -0700
Committer: Ryan Blue <b...@apache.org>
Committed: Mon Oct 16 16:47:12 2017 -0700

----------------------------------------------------------------------
 Makefile                       |   7 +++
 PageIndex.md                   | 101 ++++++++++++++++++++++++++++++++++++
 README.md                      |   4 ++
 doc/images/PageIndexLayout.png | Bin 0 -> 7442 bytes
 src/main/thrift/parquet.thrift |  85 ++++++++++++++++++++++++++++++
 5 files changed, 197 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/Makefile
----------------------------------------------------------------------
diff --git a/Makefile b/Makefile
index d4cbf83..17750c1 100644
--- a/Makefile
+++ b/Makefile
@@ -17,7 +17,14 @@
 # under the License.
 #
 
+.PHONY: doc
+
 thrift:
        mkdir -p generated
        thrift --gen cpp -o generated src/main/thrift/parquet.thrift
        thrift --gen java -o generated src/main/thrift/parquet.thrift
+
+%.html: %.md
+       pandoc -f markdown_github -t html -o $@ $<
+
+doc: README.html PageIndex.html LogicalTypes.html

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/PageIndex.md
----------------------------------------------------------------------
diff --git a/PageIndex.md b/PageIndex.md
new file mode 100644
index 0000000..7ac6e42
--- /dev/null
+++ b/PageIndex.md
@@ -0,0 +1,101 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# ColumnIndex Layout to Support Page Skipping
+
+This documents describes the format for column index pages in the Parquet
+footer. These pages contain statistics for DataPages and can be used to skip
+pages when scanning data in ordered and unordered columns.
+
+## Problem Statement
+In previous versions of the format, Statistics are stored for ColumnChunks in
+ColumnMetaData and for individual pages inside DataPageHeader structs. When
+reading pages, a reader had to process the page header in order to determine
+whether the page could be skipped based on the statistics. This means the 
reader
+had to access all pages in a column, thus likely reading most of the column
+data from disk.
+
+## Goals
+1. Make both range scans and point lookups I/O efficient by allowing direct
+   access to pages based on their min and max values. In particular:
+2. A single-row lookup in a rowgroup based on the sort column of that rowgroup
+   will only read one data page per retrieved column.
+    * Range scans on the sort column will only need to read the exact data
+      pages that contain relevant data.
+    * Make other selective scans I/O efficient: if we have a very selective
+      predicate on a non-sorting column, for the other retrieved columns we
+      should only need to access data pages that contain matching rows.
+3. No additional decoding effort for scans without selective predicates, e.g.,
+   full-row group scans. If a reader determines that it does not need to read
+   the index data, it does not incur any overhead.
+4. Index pages for sorted columns use minimal storage by storing only the
+   boundary elements between pages.
+
+## Non-Goals
+* Support for the equivalent of secondary indices, ie, an index structure
+  sorted on the key values over non-sorted data.
+
+
+## Technical Approach
+
+We add two new per-column structures to the row group metadata:
+* ColumnIndex: this allows navigation to the pages of a column based on column
+  values and is used to locate data pages that contain matching values for a
+  scan predicate
+* OffsetIndex: this allows navigation by row index and is used to retrieve
+  values for rows identified as matches via the ColumnIndex. Once rows of a
+  column are skipped, the corresponding rows in the other columns have to be
+  skipped. Hence the OffsetIndexes for each column in a RowGroup are stored
+  together.
+
+The new index structures are stored separately from RowGroup, near the footer,
+so that a reader does not have to pay the I/O and deserialization cost for
+reading the them if it is not doing selective scans. The index structures'
+location and length are stored in ColumnChunk.
+
+ ![Page Index Layout](doc/images/PageIndexLayout.png)
+
+Some observations:
+* We don't need to record the lower bound for the first page and the upper
+  bound for the last page, because the row group Statistics can provide that.
+  We still include those for the sake of uniformity, and the overhead should be
+  negligible.
+* We store lower and upper bounds for the values of each page. These may be the
+  actual minimum and maximum values found on a page, but can also be (more
+  compact) values that do not exist on a page. For example, instead of storing
+  ""Blart Versenwald III", a writer may set `min_values[i]="B"`,
+  `max_values[i]="C"`. This allows writers to truncate large values and writers
+  should use this to enforce some reasonable bound on the size of the index
+  structures.
+* Readers that support ColumnIndex should not also use page statistics. The
+  only reason to write page-level statistics when writing ColumnIndex structs
+  is to support older readers (not recommended).
+
+For ordered columns, this allows a reader to find matching pages by performing
+a binary search in `min_values` and `max_values`. For unordered columns, a
+reader can find matching pages by sequentially reading `min_values` and
+`max_values`.
+
+For range scans this approach can be extended to return ranges of rows, page
+indices, and page offsets to scan in each column. The reader can then
+initialize a scanner for each column and fast forward them to the start row of
+the scan.
+
+
+

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index c01aec3..c759be9 100644
--- a/README.md
+++ b/README.md
@@ -189,6 +189,10 @@ header and readers can skip over pages they are not 
interested in.  The data for
 page follows the header and can be compressed and/or encoded.  The compression 
and
 encoding is specified in the page metadata.
 
+Additionally, files can contain an optional column index to allow readers to
+skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and
+the reasoning behind adding these to the format.
+
 ## Checksumming
 Data pages can be individually checksummed.  This allows disabling of 
checksums at the
 HDFS file level, to better support single row lookups.

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/doc/images/PageIndexLayout.png
----------------------------------------------------------------------
diff --git a/doc/images/PageIndexLayout.png b/doc/images/PageIndexLayout.png
new file mode 100644
index 0000000..83c5b02
Binary files /dev/null and b/doc/images/PageIndexLayout.png differ

http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/src/main/thrift/parquet.thrift
----------------------------------------------------------------------
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index f955347..fbca9b2 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -474,6 +474,16 @@ enum PageType {
   DATA_PAGE_V2 = 3;
 }
 
+/**
+ * Enum to annotate whether lists of min/max elements inside ColumnIndex
+ * are ordered and if so, in which direction.
+ */
+enum BoundaryOrder {
+  UNORDERED = 0;
+  ASCENDING = 1;
+  DESCENDING = 2;
+}
+
 /** Data page header */
 struct DataPageHeader {
   /** Number of values, including NULLs, in this data page. **/
@@ -663,6 +673,18 @@ struct ColumnChunk {
    * metadata.
    **/
   3: optional ColumnMetaData meta_data
+
+  /** File offset of ColumnChunk's OffsetIndex **/
+  4: optional i64 offset_index_offset
+
+  /** Size of ColumnChunk's OffsetIndex, in bytes **/
+  5: optional i32 offset_index_length
+
+  /** File offset of ColumnChunk's ColumnIndex **/
+  6: optional i64 column_index_offset
+
+  /** Size of ColumnChunk's ColumnIndex, in bytes **/
+  7: optional i32 column_index_length
 }
 
 struct RowGroup {
@@ -737,6 +759,69 @@ union ColumnOrder {
   1: TypeDefinedOrder TYPE_ORDER;
 }
 
+struct PageLocation {
+  /** Offset of the page in the file **/
+  1: required i64 offset
+
+  /**
+   * Size of the page, including header. Sum of compressed_page_size and header
+   * length
+   */
+  2: required i32 compressed_page_size
+
+  /**
+   * Index within the RowGroup of the first row of the page; this means pages
+   * change on record boundaries (r = 0).
+   */
+  3: required i64 first_row_index
+}
+
+struct OffsetIndex {
+  /**
+   * PageLocations, ordered by increasing PageLocation.offset. It is required
+   * that page_locations[i].first_row_index < 
page_locations[i+1].first_row_index.
+   */
+  1: required list<PageLocation> page_locations
+}
+
+/**
+ * Description for ColumnIndex.
+ * Each <array-field>[i] refers to the page at OffsetIndex.page_locations[i]
+ */
+struct ColumnIndex {
+  /**
+   * A list of Boolean values to determine the validity of the corresponding
+   * min and max values. If true, a page contains only null values, and writers
+   * have to set the corresponding entries in min_values and max_values to
+   * byte[0], so that all lists have the same length. If false, the
+   * corresponding entries in min_values and max_values must be valid.
+   */
+  1: required list<bool> null_pages
+
+  /**
+   * Two lists containing lower and upper bounds for the values of each page.
+   * These may be the actual minimum and maximum values found on a page, but
+   * can also be (more compact) values that do not exist on a page. For
+   * example, instead of storing ""Blart Versenwald III", a writer may set
+   * min_values[i]="B", max_values[i]="C". Such more compact values must still
+   * be valid values within the column's logical type. Readers must make sure
+   * that list entries are populated before using them by inspecting 
null_pages.
+   */
+  2: required list<binary> min_values
+  3: required list<binary> max_values
+
+  /**
+   * Stores whether both min_values and max_values are orderd and if so, in
+   * which direction. This allows readers to perform binary searches in both
+   * lists. Readers cannot assume that max_values[i] <= min_values[i+1], even
+   * if the lists are ordered.
+   */
+  4: required BoundaryOrder boundary_order
+
+  /** A list containing the number of null values for each page **/
+  5: optional list<i64> null_counts
+}
+
 /**
  * Description for file metadata
  */

parquet-format git commit: PARQUET-922: Add column indexes to parquet.thrift

Reply via email to