(parquet-format) branch master updated: PARQUET-2485: Be more consistent with BYTE_ARRAY types (#251)

gangwu Tue, 18 Jun 2024 18:12:50 -0700

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new 18df2d4  PARQUET-2485: Be more consistent with BYTE_ARRAY types (#251)
18df2d4 is described below

commit 18df2d48bd9cfb4ce6f60610de98610963c68112
Author: Ed Seidl <etse...@users.noreply.github.com>
AuthorDate: Tue Jun 18 18:12:42 2024 -0700

    PARQUET-2485: Be more consistent with BYTE_ARRAY types (#251)
    
    Changes instances of 'binary' to BYTE_ARRAY where appropriate.
    Also fixes some uses of FIXED_LEN_BYTE_ARRAY.
    
    Co-authored-by: Andrew Lamb <and...@nerdnetworks.org>
---
 LogicalTypes.md                | 47 +++++++++++++++++++++---------------------
 README.md                      |  4 ++--
 src/main/thrift/parquet.thrift | 18 ++++++++--------
 3 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/LogicalTypes.md b/LogicalTypes.md
index 395e8fc..b55a908 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -23,7 +23,8 @@ Parquet Logical Type Definitions
 Logical types are used to extend the types that parquet can be used to store,
 by specifying how the primitive types should be interpreted. This keeps the set
 of primitive types to a minimum and reuses parquet's efficient encodings. For
-example, strings are stored as byte arrays (binary) with a UTF8 annotation.
+example, strings are stored with the primitive type `BYTE_ARRAY` with a 
`STRING`
+annotation.
 
 This file contains the specification for all logical types.
 
@@ -59,7 +60,7 @@ Compatibility considerations are mentioned for each 
annotation in the correspond
 
 ### STRING
 
-`STRING` may only be used to annotate the binary primitive type and indicates
+`STRING` may only be used to annotate the `BYTE_ARRAY` primitive type and 
indicates
 that the byte array should be interpreted as a UTF-8 encoded character string.
 
 The sort order used for `STRING` strings is unsigned byte-wise comparison.
@@ -70,7 +71,7 @@ The sort order used for `STRING` strings is unsigned 
byte-wise comparison.
 
 ### ENUM
 
-`ENUM` annotates the binary primitive type and indicates that the value
+`ENUM` annotates the `BYTE_ARRAY` primitive type and indicates that the value
 was converted from an enumerated type in another data model (e.g. Thrift, 
Avro, Protobuf).
 Applications using a data model lacking a native enum type should interpret 
`ENUM`
 annotated field as a UTF-8 encoded string. 
@@ -79,9 +80,9 @@ The sort order used for `ENUM` values is unsigned byte-wise 
comparison.
 
 ### UUID
 
-`UUID` annotates a 16-byte fixed-length binary. The value is encoded using
-big-endian, so that `00112233-4455-6677-8899-aabbccddeeff` is encoded as the
-bytes `00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff`
+`UUID` annotates a 16-byte `FIXED_LEN_BYTE_ARRAY` primitive type. The value is
+encoded using big-endian, so that `00112233-4455-6677-8899-aabbccddeeff` is 
encoded
+as the bytes `00 11 22 33 44 55 66 77 88 99 aa bb cc dd ee ff`
 (This example is from [wikipedia's UUID page][wiki-uuid]).
 
 The sort order used for `UUID` values is unsigned byte-wise comparison.
@@ -211,8 +212,8 @@ unsigned integers with 8, 16, 32, or 64 bit width.
 `DECIMAL` annotation represents arbitrary-precision signed decimal numbers of
 the form `unscaledValue * 10^(-scale)`.
 
-The primitive type stores an unscaled integer value. For byte arrays, binary
-and fixed, the unscaled number must be encoded as two's complement using
+The primitive type stores an unscaled integer value. For `BYTE_ARRAY` and 
+`FIXED_LEN_BYTE_ARRAY`, the unscaled number must be encoded as two's 
complement using
 big-endian byte order (the most significant byte is the zeroth element). The
 scale stores the number of digits of that value that are to the right of the
 decimal point, and the precision stores the maximum number of digits supported
@@ -228,7 +229,7 @@ integer. A precision too large for the underlying type (see 
below) is an error.
   warning
 * `fixed_len_byte_array`: precision is limited by the array size. Length `n`
   can store &lt;= `floor(log_10(2^(8*n - 1) - 1))` base-10 digits
-* `binary`: `precision` is not limited, but is required. The minimum number of
+* `byte_array`: `precision` is not limited, but is required. The minimum 
number of
   bytes to store the unscaled value should be used.
 
 The sort order used for `DECIMAL` values is signed comparison of the 
represented
@@ -251,7 +252,7 @@ The `FLOAT16` annotation represents half-precision 
floating-point numbers in the
 
 Used in contexts where precision is traded off for smaller footprint and 
potentially better performance.
 
-The primitive type is a 2-byte fixed length binary.
+The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
@@ -544,8 +545,8 @@ Embedded types do not have type-specific orderings.
 
 ### JSON
 
-`JSON` is used for an embedded JSON document. It must annotate a `binary`
-primitive type. The `binary` data is interpreted as a UTF-8 encoded character
+`JSON` is used for an embedded JSON document. It must annotate a `BYTE_ARRAY`
+primitive type. The `BYTE_ARRAY` data is interpreted as a UTF-8 encoded 
character
 string of valid JSON as defined by the [JSON specification][json-spec]
 
 [json-spec]: http://json.org/
@@ -554,8 +555,8 @@ The sort order used for `JSON` is unsigned byte-wise 
comparison.
 
 ### BSON
 
-`BSON` is used for an embedded BSON document. It must annotate a `binary`
-primitive type. The `binary` data is interpreted as an encoded BSON document as
+`BSON` is used for an embedded BSON document. It must annotate a `BYTE_ARRAY`
+primitive type. The `BYTE_ARRAY` data is interpreted as an encoded BSON 
document as
 defined by the [BSON specification][bson-spec].
 
 [bson-spec]: http://bsonspec.org/spec.html
@@ -604,14 +605,14 @@ The following examples demonstrate two of the possible 
lists of string values.
 // List<String> (list non-null, elements nullable)
 required group my_list (LIST) {
   repeated group list {
-    optional binary element (UTF8);
+    optional binary element (STRING);
   }
 }
 
 // List<String> (list nullable, elements non-null)
 optional group my_list (LIST) {
   repeated group list {
-    required binary element (UTF8);
+    required binary element (STRING);
   }
 }
 ```
@@ -642,7 +643,7 @@ even though the repeated group is named `element`.
 ```
 optional group my_list (LIST) {
   repeated group element {
-    required binary str (UTF8);
+    required binary str (STRING);
   };
 }
 ```
@@ -672,7 +673,7 @@ optional group my_list (LIST) {
 // List<Tuple<String, Integer>> (nullable list, non-null elements)
 optional group my_list (LIST) {
   repeated group element {
-    required binary str (UTF8);
+    required binary str (STRING);
     required int32 num;
   };
 }
@@ -680,14 +681,14 @@ optional group my_list (LIST) {
 // List<OneTuple<String>> (nullable list, non-null elements)
 optional group my_list (LIST) {
   repeated group array {
-    required binary str (UTF8);
+    required binary str (STRING);
   };
 }
 
 // List<OneTuple<String>> (nullable list, non-null elements)
 optional group my_list (LIST) {
   repeated group my_list_tuple {
-    required binary str (UTF8);
+    required binary str (STRING);
   };
 }
 ```
@@ -723,7 +724,7 @@ nullable integers:
 // Map<String, Integer>
 required group my_map (MAP) {
   repeated group key_value {
-    required binary key (UTF8);
+    required binary key (STRING);
     optional int32 value;
   }
 }
@@ -752,7 +753,7 @@ Examples that can be interpreted using these rules:
 // Map<String, Integer> (nullable map, non-null values)
 optional group my_map (MAP) {
   repeated group map {
-    required binary str (UTF8);
+    required binary str (STRING);
     required int32 num;
   }
 }
@@ -760,7 +761,7 @@ optional group my_map (MAP) {
 // Map<String, Integer> (nullable map, nullable values)
 optional group my_map (MAP_KEY_VALUE) {
   repeated group map {
-    required binary key (UTF8);
+    required binary key (STRING);
     optional int32 value;
   }
 }
diff --git a/README.md b/README.md
index d949757..9567c63 100644
--- a/README.md
+++ b/README.md
@@ -143,8 +143,8 @@ readers and writers for the format.  The types are:
 Logical types are used to extend the types that parquet can be used to store,
 by specifying how the primitive types should be interpreted. This keeps the set
 of primitive types to a minimum and reuses parquet's efficient encodings. For
-example, strings are stored as byte arrays (binary) with a UTF8 annotation.
-These annotations define how to further decode and interpret the data.
+example, strings are stored with the primitive type BYTE_ARRAY with a STRING
+annotation. These annotations define how to further decode and interpret the 
data.
 Annotations are stored as `LogicalType` fields in the file metadata and are
 documented in [LogicalTypes.md][logical-types].
 
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 368461b..934b3ca 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -60,14 +60,14 @@ enum ConvertedType {
    * values */
   LIST = 3;
 
-  /** an enum is converted into a binary field */
+  /** an enum is converted into a BYTE_ARRAY field */
   ENUM = 4;
 
   /**
    * A decimal value.
    *
-   * This may be used to annotate binary or fixed primitive types. The
-   * underlying byte array stores the unscaled value encoded as two's
+   * This may be used to annotate BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY primitive
+   * types. The underlying byte array stores the unscaled value encoded as 
two's
    * complement using big-endian byte order (the most significant byte is the
    * zeroth element). The value of the decimal is the value * 10^{-scale}.
    *
@@ -158,7 +158,7 @@ enum ConvertedType {
   /**
    * An embedded BSON document
    *
-   * A BSON document embedded within a single BINARY column.
+   * A BSON document embedded within a single BYTE_ARRAY column.
    */
   BSON = 20;
 
@@ -282,11 +282,11 @@ struct Statistics {
 }
 
 /** Empty structs to use as logical type annotations */
-struct StringType {}  // allowed for BINARY, must be encoded with UTF-8
+struct StringType {}  // allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct UUIDType {}    // allowed for FIXED[16], must encoded raw UUID bytes
 struct MapType {}     // see LogicalTypes.md
 struct ListType {}    // see LogicalTypes.md
-struct EnumType {}    // allowed for BINARY, must be encoded with UTF-8
+struct EnumType {}    // allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct DateType {}    // allowed for INT32
 struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
 
@@ -308,7 +308,7 @@ struct NullType {}    // allowed for any physical type, 
only null values stored
  * To maintain forward-compatibility in v1, implementations using this logical
  * type must also set scale and precision on the annotated SchemaElement.
  *
- * Allowed for physical types: INT32, INT64, FIXED, and BINARY
+ * Allowed for physical types: INT32, INT64, FIXED_LEN_BYTE_ARRAY, and 
BYTE_ARRAY.
  */
 struct DecimalType {
   1: required i32 scale
@@ -360,7 +360,7 @@ struct IntType {
 /**
  * Embedded JSON logical type annotation
  *
- * Allowed for physical types: BINARY
+ * Allowed for physical types: BYTE_ARRAY
  */
 struct JsonType {
 }
@@ -368,7 +368,7 @@ struct JsonType {
 /**
  * Embedded BSON logical type annotation
  *
- * Allowed for physical types: BINARY
+ * Allowed for physical types: BYTE_ARRAY
  */
 struct BsonType {
 }

(parquet-format) branch master updated: PARQUET-2485: Be more consistent with BYTE_ARRAY types (#251)

Reply via email to