Repository: parquet-format Updated Branches: refs/heads/master 65e851eae -> 041708da1
PARQUET-686: Add Order to store the order used for min/max stats. This adds a new enum, `Order`, that will be set to the order used to produce the min and max values in all `Statistics` objects (at the page level). `Order` has 8 symbols: `SIGNED`, `UNSIGNED`, and 6 symbols for custom orderings. This also adds a `CustomOrder` struct that is used to map the custom order symbols to string descriptors, such as [order keywords used by ICU collating sequences](http://userguide.icu-project.org/collation/api#TOC-Instantiating-the-Predefined-Collators). `CustomOrder` mappings are stored in the file footer. Author: Ryan Blue <b...@apache.org> Closes #46 from rdblue/PARQUET-686-add-stats-ordering and squashes the following commits: f878c34 [Ryan Blue] PARQUET-686: Remove Order enum. 9447fb8 [Ryan Blue] PARQUET-686: Use "is" instead of "must be". ffbb60b [Ryan Blue] PARQUET-686: Store ColumnOrder as a union. c6e43b0 [Ryan Blue] PARQUET-686: Add new min_value and max_value stats. eed4d47 [Ryan Blue] PARQUET-686: Add clarifications from review comments. 9962df8 [Ryan Blue] PARQUET-686: Remove is_ascending and number columns starting with 1. faa9edb [Ryan Blue] PARQUET-686: Add order specs to logical types. 4534062 [Ryan Blue] PARQUET-686: Add ColumnOrders to FileMetaData. Project: http://git-wip-us.apache.org/repos/asf/parquet-format/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-format/commit/041708da Tree: http://git-wip-us.apache.org/repos/asf/parquet-format/tree/041708da Diff: http://git-wip-us.apache.org/repos/asf/parquet-format/diff/041708da Branch: refs/heads/master Commit: 041708da1af52e7cb9288c331b542aa25b68a2b6 Parents: 65e851e Author: Ryan Blue <b...@apache.org> Authored: Mon Apr 17 11:23:41 2017 -0700 Committer: Ryan Blue <b...@apache.org> Committed: Mon Apr 17 11:23:41 2017 -0700 ---------------------------------------------------------------------- LogicalTypes.md | 30 +++++++++++++++++++ src/main/thrift/parquet.thrift | 59 ++++++++++++++++++++++++++++++++++++- 2 files changed, 88 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/parquet-format/blob/041708da/LogicalTypes.md ---------------------------------------------------------------------- diff --git a/LogicalTypes.md b/LogicalTypes.md index c411dbf..29cf527 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -37,6 +37,8 @@ may require additional metadata fields, as well as rules for those fields. `UTF8` may only be used to annotate the binary primitive type and indicates that the byte array should be interpreted as a UTF-8 encoded character string. +The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison. + ## Numeric Types ### Signed Integers @@ -55,6 +57,8 @@ allows. implied by the `int32` and `int64` primitive types if no other annotation is present and should be considered optional. +The sort order used for signed integer types is `SIGNED`. + ### Unsigned Integers `UINT_8`, `UINT_16`, `UINT_32`, and `UINT_64` annotations can be used to @@ -70,6 +74,8 @@ allows. `UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and `UINT_64` must annotate an `int64` primitive type. +The sort order used for unsigned integer types is `UNSIGNED`. + ### DECIMAL `DECIMAL` annotation represents arbitrary-precision signed decimal numbers of @@ -98,6 +104,15 @@ integer. A precision too large for the underlying type (see below) is an error. A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both `scale` and `precision` fields set, even if scale is 0 by default. +The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent +to signed comparison of decimal values. + +If the column uses `int32` or `int64` physical types, then signed comparison of +the integer values produces the correct ordering. If the physical type is +fixed, then the correct ordering can be produced by flipping the +most-significant bit in the first byte and then using unsigned byte-wise +comparison. + ## Date/Time Types ### DATE @@ -106,30 +121,40 @@ A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both annotate an `int32` that stores the number of days from the Unix epoch, 1 January 1970. +The sort order used for `DATE` is `SIGNED`. + ### TIME\_MILLIS `TIME_MILLIS` is used for a logical time type with millisecond precision, without a date. It must annotate an `int32` that stores the number of milliseconds after midnight. +The sort order used for `TIME\_MILLIS` is `SIGNED`. + ### TIME\_MICROS `TIME_MICROS` is used for a logical time type with microsecond precision, without a date. It must annotate an `int64` that stores the number of microseconds after midnight. +The sort order used for `TIME\_MICROS` is `SIGNED`. + ### TIMESTAMP\_MILLIS `TIMESTAMP_MILLIS` is used for a combined logical date and time type, with millisecond precision. It must annotate an `int64` that stores the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. +The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`. + ### TIMESTAMP\_MICROS `TIMESTAMP_MICROS` is used for a combined logical date and time type with microsecond precision. It must annotate an `int64` that stores the number of microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. +The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`. + ### INTERVAL `INTERVAL` is used for an interval of time. It must annotate a @@ -144,8 +169,13 @@ example, there is no requirement that a large number of days should be expressed as a mix of months and days because there is not a constant conversion from days to months. +The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by +the value of months, then days, then milliseconds with unsigned comparison. + ## Embedded Types +Embedded types do not have type-specific orderings. + ### JSON `JSON` is used for an embedded JSON document. It must annotate a `binary` http://git-wip-us.apache.org/repos/asf/parquet-format/blob/041708da/src/main/thrift/parquet.thrift ---------------------------------------------------------------------- diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index e89bc80..47812ab 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -28,6 +28,17 @@ namespace java org.apache.parquet.format * with the encodings to control the on disk storage format. * For example INT16 is not included as a type since a good encoding of INT32 * would handle this. + * + * When a logical type is not present, the type-defined sort order of these + * physical types are: + * * BOOLEAN - false, true + * * INT32 - signed comparison + * * INT64 - signed comparison + * * INT96 - signed comparison + * * FLOAT - signed comparison + * * DOUBLE - signed comparison + * * BYTE_ARRAY - unsigned byte-wise comparison + * * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison */ enum Type { BOOLEAN = 0; @@ -202,13 +213,33 @@ enum FieldRepetitionType { * All fields are optional. */ struct Statistics { - /** min and max value of the column, encoded in PLAIN encoding */ + /** + * DEPRECATED: min and max value of the column. Use min_value and max_value. + * + * Values are encoded using PLAIN encoding, except that variable-length byte + * arrays do not include a length prefix. + * + * These fields encode min and max values determined by SIGNED comparison + * only. New files should use the correct order for a column's logical type + * and store the values in the min_value and max_value fields. + * + * To support older readers, these may be set when the column order is + * SIGNED. + */ 1: optional binary max; 2: optional binary min; /** count of null value in the column */ 3: optional i64 null_count; /** count of distinct values occurring */ 4: optional i64 distinct_count; + /** + * Min and max values for the column, determined by its ColumnOrder. + * + * Values are encoded using PLAIN encoding, except that variable-length byte + * arrays do not include a length prefix. + */ + 5: optional binary max_value; + 6: optional binary min_value; } /** @@ -547,6 +578,23 @@ struct RowGroup { 4: optional list<SortingColumn> sorting_columns } +/** Empty struct to signal the order defined by the physical or logical type */ +struct TypeDefinedOrder {} + +/** + * Union to specify the order used for min, max, and sorting values in a column. + * + * Possible values are: + * * TypeDefinedOrder - the column uses the order defined by its logical or + * physical type (if there is no logical type). + * + * If the reader does not support the value of this union, min and max stats + * for this column should be ignored. + */ +union ColumnOrder { + 1: TypeDefinedOrder TYPE_ORDER; +} + /** * Description for file metadata */ @@ -576,5 +624,14 @@ struct FileMetaData { * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55) **/ 6: optional string created_by + + /** + * Sort order used for each column in this file. + * + * If this list is not present, then the order for each column is assumed to + * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored + * as fixed or bytes should be ignored. + */ + 7: optional list<ColumnOrder> column_orders; }