svn commit: r22500 - in /dev/parquet/apache-parquet-format-2.4.0-rc1: ./ apache-parquet-format-2.4.0.tar.gz apache-parquet-format-2.4.0.tar.gz.asc apache-parquet-format-2.4.0.tar.gz.md5 apache-parquet
Author: blue Date: Tue Oct 17 00:10:07 2017 New Revision: 22500 Log: Apache Parquet Format 2.4.0 RC1 Added: dev/parquet/apache-parquet-format-2.4.0-rc1/ dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz (with props) dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.asc dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.md5 dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.sha Added: dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz == Binary file - no diff available. Propchange: dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz -- svn:mime-type = application/octet-stream Added: dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.asc == --- dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.asc (added) +++ dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.asc Tue Oct 17 00:10:07 2017 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- +Version: GnuPG v1 + +iQIcBAABAgAGBQJZ5UpbAAoJEIZ4HU+ksum1lyMQAIolpocrgjb8/HLepd2k0YZC +hZNApEKYPgKBi0Vcv1osQqU6R7Ann6EV0yM9lInpo8Wb1OO6+aIh3dEX0dx3eX51 +rTg+z8lL73r5FXySn+zqJ30gmVlhkMZboviW5cZQRKJBTSWc/ATsz6dJ1HADZmGn +4y8z18kirNhdlxEOJ/HRP26mCjYyQ6sasLHDfmQz4RK8lRb4XrcSBeSLWBpUY/TV +EW/9DN64SuxaaT5dVszthzx6QxFKqwApUQJq9e1xVYLZTvxcL7sKljwCPJDhpIFq +gVbnXYwzMnXOmX7OdTVaMyi2irRKsbNm2a8kPZq+Ocs+0wrQ1+dDohQLUwDBvRcf +Tnyc44zGd8Q/3eSBXmXTkv8FpNpB95mpbita4HrPgJ/cF34XJj3x2KzD/3Stbo/B +iBwfQ2Y9gaGmKmmu2FUfrLhcuszWxm8QOROl8ALCPp9xYx4zEb9hxgOKcCeoigA7 +A+RiOtUoh2cypnVLh1EQCgkbMRFLU7QcCPF68OCQDtr+jFynmgUfINyJSW9IT4F2 +y+KruEqBk112GmlTN0UG4hehg/Zzg442bC/QlHvSW95NUgQhQ000MCuvUFIETj1w +dZv86dRJq36jcWgB7EgPQAlc9s43w/uX/XjlN07FGLhi+3TQctH0shjVTGV3cx7b +ml50HL/FO/otGiEA9IKt +=yJsD +-END PGP SIGNATURE- Added: dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.md5 == --- dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.md5 (added) +++ dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.md5 Tue Oct 17 00:10:07 2017 @@ -0,0 +1,2 @@ +apache-parquet-format-2.4.0.tar.gz: 32 13 9D E2 90 1D AF 7C 98 43 CE 1C 52 3F +00 C8 Added: dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.sha == --- dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.sha (added) +++ dev/parquet/apache-parquet-format-2.4.0-rc1/apache-parquet-format-2.4.0.tar.gz.sha Tue Oct 17 00:10:07 2017 @@ -0,0 +1 @@ +b26a31a09870f3805087a863854e35138adeea12 apache-parquet-format-2.4.0.tar.gz
[parquet-format] Git Push Summary
Repository: parquet-format Updated Tags: refs/tags/apache-parquet-format-2.4.0 [created] 403dd0605
parquet-format git commit: [maven-release-plugin] prepare for next development iteration
Repository: parquet-format Updated Branches: refs/heads/master 3fb6b391d -> da4e39a15 [maven-release-plugin] prepare for next development iteration Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/da4e39a1 Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/da4e39a1 Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/da4e39a1 Branch: refs/heads/master Commit: da4e39a15b3fa6ad899ab23298e8697ec2c199e8 Parents: 3fb6b39 Author: Ryan BlueAuthored: Mon Oct 16 17:07:13 2017 -0700 Committer: Ryan Blue Committed: Mon Oct 16 17:07:13 2017 -0700 -- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/da4e39a1/pom.xml -- diff --git a/pom.xml b/pom.xml index 98ca595..5c9032c 100644 --- a/pom.xml +++ b/pom.xml @@ -28,7 +28,7 @@ org.apache.parquet parquet-format - 2.4.0 + 2.4.1-SNAPSHOT jar Apache Parquet Format @@ -39,7 +39,7 @@ scm:git:g...@github.com:apache/parquet-format.git scm:git:g...@github.com:apache/parquet-format.git scm:git:https://git-wip-us.apache.org/repos/asf/parquet-format.git -apache-parquet-format-2.4.0 +HEAD
parquet-format git commit: PARQUET-1134: Update CHANGES.md.
Repository: parquet-format Updated Branches: refs/heads/master f1de77d31 -> 54cc08d2c PARQUET-1134: Update CHANGES.md. Also cleaning up old PRs: Closes #37 Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/54cc08d2 Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/54cc08d2 Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/54cc08d2 Branch: refs/heads/master Commit: 54cc08d2c752d51f054f87efe4aa3d984794a6b0 Parents: f1de77d Author: Ryan BlueAuthored: Mon Oct 16 17:01:33 2017 -0700 Committer: Ryan Blue Committed: Mon Oct 16 17:01:33 2017 -0700 -- CHANGES.md | 37 + 1 file changed, 37 insertions(+) -- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/54cc08d2/CHANGES.md -- diff --git a/CHANGES.md b/CHANGES.md index befe532..85d710c 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -19,6 +19,43 @@ # Parquet # +### Version 2.4.0 ### + + Bug + +* [PARQUET-255](https://issues.apache.org/jira/browse/PARQUET-255) - Typo in decimal type specification +* [PARQUET-322](https://issues.apache.org/jira/browse/PARQUET-322) - Document ENUM as a logical type +* [PARQUET-412](https://issues.apache.org/jira/browse/PARQUET-412) - Format: Do not shade slf4j-api +* [PARQUET-419](https://issues.apache.org/jira/browse/PARQUET-419) - Update dev script in parquet-cpp to remove incubator. +* [PARQUET-655](https://issues.apache.org/jira/browse/PARQUET-655) - The LogicalTypes.md link in README.md points to the old Parquet GitHub repository +* [PARQUET-1031](https://issues.apache.org/jira/browse/PARQUET-1031) - Fix spelling errors, whitespace, GitHub urls +* [PARQUET-1032](https://issues.apache.org/jira/browse/PARQUET-1032) - Change link in Encodings.md for variable length encoding +* [PARQUET-1050](https://issues.apache.org/jira/browse/PARQUET-1050) - The comment of Parquet Format Thrift definition file error +* [PARQUET-1076](https://issues.apache.org/jira/browse/PARQUET-1076) - [Format] Switch to long key ids in KEYs file +* [PARQUET-1091](https://issues.apache.org/jira/browse/PARQUET-1091) - Wrong and broken links in README +* [PARQUET-1102](https://issues.apache.org/jira/browse/PARQUET-1102) - Travis CI builds are failing for parquet-format PRs +* [PARQUET-1134](https://issues.apache.org/jira/browse/PARQUET-1134) - Release Parquet format 2.4.0 +* [PARQUET-1136](https://issues.apache.org/jira/browse/PARQUET-1136) - Makefile is broken + + Improvement + +* [PARQUET-371](https://issues.apache.org/jira/browse/PARQUET-371) - Bumps Thrift version to 0.9.3 +* [PARQUET-407](https://issues.apache.org/jira/browse/PARQUET-407) - Incorrect delta-encoding example +* [PARQUET-428](https://issues.apache.org/jira/browse/PARQUET-428) - Support INT96 and FIXED_LEN_BYTE_ARRAY types +* [PARQUET-601](https://issues.apache.org/jira/browse/PARQUET-601) - Add support in Parquet to configure the encoding used by ValueWriters +* [PARQUET-609](https://issues.apache.org/jira/browse/PARQUET-609) - Add Brotli compression to Parquet format +* [PARQUET-757](https://issues.apache.org/jira/browse/PARQUET-757) - Add NULL type to Bring Parquet logical types to par with Arrow +* [PARQUET-804](https://issues.apache.org/jira/browse/PARQUET-804) - parquet-format README.md still links to the old Google group +* [PARQUET-922](https://issues.apache.org/jira/browse/PARQUET-922) - Add index pages to the format to support efficient page skipping +* [PARQUET-1049](https://issues.apache.org/jira/browse/PARQUET-1049) - Make thrift version a property in pom.xml + + Task + +* [PARQUET-450](https://issues.apache.org/jira/browse/PARQUET-450) - Small typos/issues in parquet-format documentation +* [PARQUET-667](https://issues.apache.org/jira/browse/PARQUET-667) - Update committers lists to point to apache website +* [PARQUET-1124](https://issues.apache.org/jira/browse/PARQUET-1124) - Add new compression codecs to the Parquet spec +* [PARQUET-1125](https://issues.apache.org/jira/browse/PARQUET-1125) - Add UUID logical type + ### Version 2.2.0 ### * [PARQUET-23](https://issues.apache.org/jira/browse/PARQUET-23): Rename packages and maven coordinates to org.apache
parquet-format git commit: PARQUET-922: Add column indexes to parquet.thrift
Repository: parquet-format Updated Branches: refs/heads/master 65f105707 -> f1de77d31 PARQUET-922: Add column indexes to parquet.thrift I moved the design doc to a .md file and addressed the first round of review comments. closes #63 This is based on work done by @mkornacker and @lekv who wrote the initial proposal and @poojanilangekar who evolved the design, wrote a prototypical implementation, and evaluated its performance. Author: Lars VolkerAuthor: poojanilangekar Author: Lars Volker Closes #72 from lekv/index and squashes the following commits: babb356 [Lars Volker] Address comments from Marcel and Zoltan. 6897c2b [Lars Volker] Address Marcel's comments. bbb3670 [Lars Volker] Reinstate PageIndex.md ebcb33f [Lars Volker] Revert "Extend comments in parquet.thrift, remove PageIndex.md" 877e14c [Lars Volker] Revert "Remove picture" 5df2bbc [Lars Volker] Remove picture a39bf49 [Lars Volker] Extend comments in parquet.thrift, remove PageIndex.md 9ea100a [Lars Volker] Address comments from Zoltan. 9f79d72 [Lars Volker] Merge branch 'master' into index 5e8ea1c [Lars Volker] Fix Typo da6f648 [Lars Volker] Addressing more comments 8541da7 [Lars Volker] Addressing review comments from the Parquet sync meeting 8e3c533 [Lars Volker] More review comments 109b20d [Lars Volker] Address more review comments, clarify the description of ColumnIndex f5bfe55 [Lars Volker] Address review comments on parquet.thrift. 700cc00 [Lars Volker] PARQUET-922: Add documentation on page indexes f983794 [poojanilangekar] PARQUET-922: ColumnIndex Layout to Support Page Skipping Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/f1de77d3 Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/f1de77d3 Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/f1de77d3 Branch: refs/heads/master Commit: f1de77d31936f4d50f1286676a0034b6339918ee Parents: 65f1057 Author: Lars Volker Authored: Mon Oct 16 16:47:12 2017 -0700 Committer: Ryan Blue Committed: Mon Oct 16 16:47:12 2017 -0700 -- Makefile | 7 +++ PageIndex.md | 101 README.md | 4 ++ doc/images/PageIndexLayout.png | Bin 0 -> 7442 bytes src/main/thrift/parquet.thrift | 85 ++ 5 files changed, 197 insertions(+) -- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/Makefile -- diff --git a/Makefile b/Makefile index d4cbf83..17750c1 100644 --- a/Makefile +++ b/Makefile @@ -17,7 +17,14 @@ # under the License. # +.PHONY: doc + thrift: mkdir -p generated thrift --gen cpp -o generated src/main/thrift/parquet.thrift thrift --gen java -o generated src/main/thrift/parquet.thrift + +%.html: %.md + pandoc -f markdown_github -t html -o $@ $< + +doc: README.html PageIndex.html LogicalTypes.html http://git-wip-us.apache.org/repos/asf/parquet-format/blob/f1de77d3/PageIndex.md -- diff --git a/PageIndex.md b/PageIndex.md new file mode 100644 index 000..7ac6e42 --- /dev/null +++ b/PageIndex.md @@ -0,0 +1,101 @@ + + +# ColumnIndex Layout to Support Page Skipping + +This documents describes the format for column index pages in the Parquet +footer. These pages contain statistics for DataPages and can be used to skip +pages when scanning data in ordered and unordered columns. + +## Problem Statement +In previous versions of the format, Statistics are stored for ColumnChunks in +ColumnMetaData and for individual pages inside DataPageHeader structs. When +reading pages, a reader had to process the page header in order to determine +whether the page could be skipped based on the statistics. This means the reader +had to access all pages in a column, thus likely reading most of the column +data from disk. + +## Goals +1. Make both range scans and point lookups I/O efficient by allowing direct + access to pages based on their min and max values. In particular: +2. A single-row lookup in a rowgroup based on the sort column of that rowgroup + will only read one data page per retrieved column. +* Range scans on the sort column will only need to read the exact data + pages that contain relevant data. +* Make other selective scans I/O efficient: if we have a very selective + predicate on a non-sorting column, for the other retrieved columns we + should only need to access data pages that contain matching rows. +3. No additional decoding effort for scans without selective predicates, e.g., + full-row group
parquet-cpp git commit: PARQUET-1138: Fix Arrow 0.7.1 build
Repository: parquet-cpp Updated Branches: refs/heads/master 475be0ba7 -> 06c5fb88c PARQUET-1138: Fix Arrow 0.7.1 build This is a very minor issue with the 1.3.1 RC0. If this build passes cleanly I will vote to approve the release as this only affects this unit test Author: Wes McKinneyCloses #410 from wesm/arrow-0.7.1-fix-build and squashes the following commits: fd6a527 [Wes McKinney] Add comment f95ff0b [Wes McKinney] Fix compilation with Arrow 0.7.1, set 0.7.1 in ThirdpartyToolchain.cmake Project: http://git-wip-us.apache.org/repos/asf/parquet-cpp/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-cpp/commit/06c5fb88 Tree: http://git-wip-us.apache.org/repos/asf/parquet-cpp/tree/06c5fb88 Diff: http://git-wip-us.apache.org/repos/asf/parquet-cpp/diff/06c5fb88 Branch: refs/heads/master Commit: 06c5fb88c722158be5f9413cd55b988af8f9ef82 Parents: 475be0b Author: Wes McKinney Authored: Mon Oct 16 20:58:26 2017 +0200 Committer: Uwe L. Korn Committed: Mon Oct 16 20:58:26 2017 +0200 -- cmake_modules/ThirdpartyToolchain.cmake | 2 +- src/parquet/arrow/arrow-reader-writer-test.cc | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/parquet-cpp/blob/06c5fb88/cmake_modules/ThirdpartyToolchain.cmake -- diff --git a/cmake_modules/ThirdpartyToolchain.cmake b/cmake_modules/ThirdpartyToolchain.cmake index 3961abd..a470fc1 100644 --- a/cmake_modules/ThirdpartyToolchain.cmake +++ b/cmake_modules/ThirdpartyToolchain.cmake @@ -366,7 +366,7 @@ if (NOT ARROW_FOUND) -DARROW_BUILD_TESTS=OFF) if ("$ENV{PARQUET_ARROW_VERSION}" STREQUAL "") -set(ARROW_VERSION "8309556c7d2b0e14df1422baa574cf2de8c1bd3b") +set(ARROW_VERSION "0e21f84c2fc26dba949a03ee7d7ebfade0a65b81") # Arrow 0.7.1 else() set(ARROW_VERSION "$ENV{PARQUET_ARROW_VERSION}") endif() http://git-wip-us.apache.org/repos/asf/parquet-cpp/blob/06c5fb88/src/parquet/arrow/arrow-reader-writer-test.cc -- diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc b/src/parquet/arrow/arrow-reader-writer-test.cc index fc6410d..a18c565 100644 --- a/src/parquet/arrow/arrow-reader-writer-test.cc +++ b/src/parquet/arrow/arrow-reader-writer-test.cc @@ -951,7 +951,7 @@ TEST_F(TestNullParquetIO, NullDictionaryColumn) { std::shared_ptr expected_values = std::make_shared<::arrow::NullArray>(SMALL_SIZE); - AssertArraysEqual(*expected_values, *chunked_array->chunk(0)); + internal::AssertArraysEqual(*expected_values, *chunked_array->chunk(0)); } template