Repository: parquet-format Updated Branches: refs/heads/master 65f105707 -> f1de77d31
PARQUET-922: Add column indexes to parquet.thrift I moved the design doc to a .md file and addressed the first round of review comments. closes #63 This is based on work done by @mkornacker and @lekv who wrote the initial proposal and @poojanilangekar who evolved the design, wrote a prototypical implementation, and evaluated its performance. Author: Lars Volker <l...@cloudera.com> Author: poojanilangekar <nilangekar.po...@gmail.com> Author: Lars Volker <lvol...@gmail.com> Closes #72 from lekv/index and squashes the following commits: babb356 [Lars Volker] Address comments from Marcel and Zoltan. 6897c2b [Lars Volker] Address Marcel's comments. bbb3670 [Lars Volker] Reinstate PageIndex.md ebcb33f [Lars Volker] Revert "Extend comments in parquet.thrift, remove PageIndex.md" 877e14c [Lars Volker] Revert "Remove picture" 5df2bbc [Lars Volker] Remove picture a39bf49 [Lars Volker] Extend comments in parquet.thrift, remove PageIndex.md 9ea100a [Lars Volker] Address comments from Zoltan. 9f79d72 [Lars Volker] Merge branch 'master' into index 5e8ea1c [Lars Volker] Fix Typo da6f648 [Lars Volker] Addressing more comments 8541da7 [Lars Volker] Addressing review comments from the Parquet sync meeting 8e3c533 [Lars Volker] More review comments 109b20d [Lars Volker] Address more review comments, clarify the description of ColumnIndex f5bfe55 [Lars Volker] Address review comments on parquet.thrift. 700cc00 [Lars Volker] PARQUET-922: Add documentation on page indexes f983794 [poojanilangekar] PARQUET-922: ColumnIndex Layout to Support Page Skipping Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/f1de77d3 Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/f1de77d3 Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/f1de77d3 Branch: refs/heads/master Commit: f1de77d31936f4d50f1286676a0034b6339918ee Parents: 65f1057 Author: Lars Volker <l...@cloudera.com> Authored: Mon Oct 16 16:47:12 2017 -0700 Committer: Ryan Blue <b...@apache.org> Committed: Mon Oct 16 16:47:12 2017 -0700 ---------------------------------------------------------------------- Makefile | 7 +++ PageIndex.md | 101 ++++++++++++++++++++++++++++++++++++ README.md | 4 ++ doc/images/PageIndexLayout.png | Bin 0 -> 7442 bytes src/main/thrift/parquet.thrift | 85 ++++++++++++++++++++++++++++++ 5 files changed, 197 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/Makefile ---------------------------------------------------------------------- diff --git a/Makefile b/Makefile index d4cbf83..17750c1 100644 --- a/Makefile +++ b/Makefile @@ -17,7 +17,14 @@ # under the License. # +.PHONY: doc + thrift: mkdir -p generated thrift --gen cpp -o generated src/main/thrift/parquet.thrift thrift --gen java -o generated src/main/thrift/parquet.thrift + +%.html: %.md + pandoc -f markdown_github -t html -o $@ $< + +doc: README.html PageIndex.html LogicalTypes.html http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/PageIndex.md ---------------------------------------------------------------------- diff --git a/PageIndex.md b/PageIndex.md new file mode 100644 index 0000000..7ac6e42 --- /dev/null +++ b/PageIndex.md @@ -0,0 +1,101 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one + - or more contributor license agreements. See the NOTICE file + - distributed with this work for additional information + - regarding copyright ownership. The ASF licenses this file + - to you under the Apache License, Version 2.0 (the + - "License"); you may not use this file except in compliance + - with the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, + - software distributed under the License is distributed on an + - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + - KIND, either express or implied. See the License for the + - specific language governing permissions and limitations + - under the License. + --> + +# ColumnIndex Layout to Support Page Skipping + +This documents describes the format for column index pages in the Parquet +footer. These pages contain statistics for DataPages and can be used to skip +pages when scanning data in ordered and unordered columns. + +## Problem Statement +In previous versions of the format, Statistics are stored for ColumnChunks in +ColumnMetaData and for individual pages inside DataPageHeader structs. When +reading pages, a reader had to process the page header in order to determine +whether the page could be skipped based on the statistics. This means the reader +had to access all pages in a column, thus likely reading most of the column +data from disk. + +## Goals +1. Make both range scans and point lookups I/O efficient by allowing direct + access to pages based on their min and max values. In particular: +2. A single-row lookup in a rowgroup based on the sort column of that rowgroup + will only read one data page per retrieved column. + * Range scans on the sort column will only need to read the exact data + pages that contain relevant data. + * Make other selective scans I/O efficient: if we have a very selective + predicate on a non-sorting column, for the other retrieved columns we + should only need to access data pages that contain matching rows. +3. No additional decoding effort for scans without selective predicates, e.g., + full-row group scans. If a reader determines that it does not need to read + the index data, it does not incur any overhead. +4. Index pages for sorted columns use minimal storage by storing only the + boundary elements between pages. + +## Non-Goals +* Support for the equivalent of secondary indices, ie, an index structure + sorted on the key values over non-sorted data. + + +## Technical Approach + +We add two new per-column structures to the row group metadata: +* ColumnIndex: this allows navigation to the pages of a column based on column + values and is used to locate data pages that contain matching values for a + scan predicate +* OffsetIndex: this allows navigation by row index and is used to retrieve + values for rows identified as matches via the ColumnIndex. Once rows of a + column are skipped, the corresponding rows in the other columns have to be + skipped. Hence the OffsetIndexes for each column in a RowGroup are stored + together. + +The new index structures are stored separately from RowGroup, near the footer, +so that a reader does not have to pay the I/O and deserialization cost for +reading the them if it is not doing selective scans. The index structures' +location and length are stored in ColumnChunk. + + ![Page Index Layout](doc/images/PageIndexLayout.png) + +Some observations: +* We don't need to record the lower bound for the first page and the upper + bound for the last page, because the row group Statistics can provide that. + We still include those for the sake of uniformity, and the overhead should be + negligible. +* We store lower and upper bounds for the values of each page. These may be the + actual minimum and maximum values found on a page, but can also be (more + compact) values that do not exist on a page. For example, instead of storing + ""Blart Versenwald III", a writer may set `min_values[i]="B"`, + `max_values[i]="C"`. This allows writers to truncate large values and writers + should use this to enforce some reasonable bound on the size of the index + structures. +* Readers that support ColumnIndex should not also use page statistics. The + only reason to write page-level statistics when writing ColumnIndex structs + is to support older readers (not recommended). + +For ordered columns, this allows a reader to find matching pages by performing +a binary search in `min_values` and `max_values`. For unordered columns, a +reader can find matching pages by sequentially reading `min_values` and +`max_values`. + +For range scans this approach can be extended to return ranges of rows, page +indices, and page offsets to scan in each column. The reader can then +initialize a scanner for each column and fast forward them to the start row of +the scan. + + + http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/README.md ---------------------------------------------------------------------- diff --git a/README.md b/README.md index c01aec3..c759be9 100644 --- a/README.md +++ b/README.md @@ -189,6 +189,10 @@ header and readers can skip over pages they are not interested in. The data for page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata. +Additionally, files can contain an optional column index to allow readers to +skip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and +the reasoning behind adding these to the format. + ## Checksumming Data pages can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/doc/images/PageIndexLayout.png ---------------------------------------------------------------------- diff --git a/doc/images/PageIndexLayout.png b/doc/images/PageIndexLayout.png new file mode 100644 index 0000000..83c5b02 Binary files /dev/null and b/doc/images/PageIndexLayout.png differ http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/src/main/thrift/parquet.thrift ---------------------------------------------------------------------- diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index f955347..fbca9b2 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -474,6 +474,16 @@ enum PageType { DATA_PAGE_V2 = 3; } +/** + * Enum to annotate whether lists of min/max elements inside ColumnIndex + * are ordered and if so, in which direction. + */ +enum BoundaryOrder { + UNORDERED = 0; + ASCENDING = 1; + DESCENDING = 2; +} + /** Data page header */ struct DataPageHeader { /** Number of values, including NULLs, in this data page. **/ @@ -663,6 +673,18 @@ struct ColumnChunk { * metadata. **/ 3: optional ColumnMetaData meta_data + + /** File offset of ColumnChunk's OffsetIndex **/ + 4: optional i64 offset_index_offset + + /** Size of ColumnChunk's OffsetIndex, in bytes **/ + 5: optional i32 offset_index_length + + /** File offset of ColumnChunk's ColumnIndex **/ + 6: optional i64 column_index_offset + + /** Size of ColumnChunk's ColumnIndex, in bytes **/ + 7: optional i32 column_index_length } struct RowGroup { @@ -737,6 +759,69 @@ union ColumnOrder { 1: TypeDefinedOrder TYPE_ORDER; } +struct PageLocation { + /** Offset of the page in the file **/ + 1: required i64 offset + + /** + * Size of the page, including header. Sum of compressed_page_size and header + * length + */ + 2: required i32 compressed_page_size + + /** + * Index within the RowGroup of the first row of the page; this means pages + * change on record boundaries (r = 0). + */ + 3: required i64 first_row_index +} + +struct OffsetIndex { + /** + * PageLocations, ordered by increasing PageLocation.offset. It is required + * that page_locations[i].first_row_index < page_locations[i+1].first_row_index. + */ + 1: required list<PageLocation> page_locations +} + +/** + * Description for ColumnIndex. + * Each <array-field>[i] refers to the page at OffsetIndex.page_locations[i] + */ +struct ColumnIndex { + /** + * A list of Boolean values to determine the validity of the corresponding + * min and max values. If true, a page contains only null values, and writers + * have to set the corresponding entries in min_values and max_values to + * byte[0], so that all lists have the same length. If false, the + * corresponding entries in min_values and max_values must be valid. + */ + 1: required list<bool> null_pages + + /** + * Two lists containing lower and upper bounds for the values of each page. + * These may be the actual minimum and maximum values found on a page, but + * can also be (more compact) values that do not exist on a page. For + * example, instead of storing ""Blart Versenwald III", a writer may set + * min_values[i]="B", max_values[i]="C". Such more compact values must still + * be valid values within the column's logical type. Readers must make sure + * that list entries are populated before using them by inspecting null_pages. + */ + 2: required list<binary> min_values + 3: required list<binary> max_values + + /** + * Stores whether both min_values and max_values are orderd and if so, in + * which direction. This allows readers to perform binary searches in both + * lists. Readers cannot assume that max_values[i] <= min_values[i+1], even + * if the lists are ordered. + */ + 4: required BoundaryOrder boundary_order + + /** A list containing the number of null values for each page **/ + 5: optional list<i64> null_counts +} + /** * Description for file metadata */