[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

GitBox Tue, 16 Feb 2021 04:59:45 -0800


pitrou commented on a change in pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#discussion_r576804103




##########
File path: CoreFeatures.md
##########
@@ -0,0 +1,188 @@
+<!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+# Parquet Core Features
+
+This document lists the core features for each parquet-format release. This
+list is a subset of the features which parquet-format makes available.
+
+## Purpose
+
+The list of core features for a certain release makes a compliance level for
+implementations. If a writer implementation claims that it is at a certain
+compliance level then it must use only features from the *core feature list* of
+that parquet-format release. If a reader implementation claims the same if must
+implement all of the listed features. This way it is easier to ensure
+compatibility between the different Parquet implementations.
+
+We cannot and don't want to stop our clients to use any features that are not
+on this list but it shall be highlighted that using these features might make
+the written Parquet files unreadable by other implementations. We can say that
+the features available in a parquet-format release (and one of the
+implementations of it) and not on the *core feature list* are experimental.
+
+## Versioning
+
+This document is versioned by the parquet-format releases which follows the
+scheme of semantic versioning. It means that no feature will be deleted from
+this document under the same major version. (We might deprecate some, though.)
+Because of the semantic versioning if one implementation supports the core
+features of the parquet-format release `a.c.x` it must be able to read any
+Parquet files written by implementations supporting the release `a.b.y` where
+`c >= b`.
+
+If a Parquet file is written according to a released version of this document
+it might be a good idea to write this version into the field `compliance_level`
+in the Thrift object `FileMetaData`.
+
+## Adding new features
+
+The idea is to only include features which are specified correctly and proven
+to be useful for everyone. Because of that we require to have at least two
+different implementations that are released and widely tested. We also require
+to implement interoperability tests for that feature to prove one
+implementation can read the data written by the other one and vice versa.
+
+## Core feature list
+
+This list is based on the [Parquet Thrift file](src/main/thrift/parquet.thrift)
+where all the data structures we might use in a Parquet file are defined.
+
+### File structure
+
+All of the required fields in the structure (and sub-structures) of
+`FileMetaData` must be set according to the specification.
+The following page types are supported:
+* Data page V1 (see `DataPageHeader`)
+* Dictionary page (see `DictionaryPageHeader`)
+
+**TODO**: list optional fields that must be filled properly.
+
+#### Column chunk file reference
+
+The optional field `file_path` in the `ColumnChunk` object of the Parquet 
footer
+(aka Parquet Thrift file) makes it available to reference an external file. 
This
+option was used for different features like _summary files_ or
+_external column chunks_. These features were never specified correctly and
+they did not spread across the different implementations. Because of that we do
+not include these features in this document and therefore the field `file_path`
+is not supported.
+
+### Types
+
+#### Primitive types
+
+The following [primitive types](README.md#types) are supported
+* `BOOLEAN`
+* `INT32`
+* `INT64`
+* `FLOAT`
+* `DOUBLE`
+* `BYTE\_ARRAY`
+* `FIXED\_LEN\_BYTE\_ARRAY`
+
+NOTE: The primitive type `INT96` is deprecated so it is intentionally not 
listed
+here.
+
+#### Logical types
+
+The [logical type](LogicalTypes.md)s are practically annotations helping to
+understand the related primitive type (or structure). Originally we have had
+the `ConvertedType` enum in the Thrift file representing all the possible
+logical types. After a while we realized it is hard to extend and so introduced
+the `LogicalType` union. For backward compatibility reasons we allow to use the
+old `ConvertedType` values according to the specified rules but we expect that
+the logical types in the file schema are defined with `LogicalType` objects.
+
+The following LogicalTypes are supported:
+* `STRING`
+* `MAP`
+* `LIST`
+* `ENUM`
+* `DECIMAL` (for which primitives?)
+* `DATE`
+* `TIME`: **(Which unit, utc?)**
+* `TIMESTAMP`: **(Which unit, utc?)**
+* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**

Review comment:
       > I think parquet-cpp and parquet-mr both support unsigned right? 
   
   parquet-cpp definitely does, since C++ has native unsigned integers (i.e. no 
cast to signed is involved).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] pitrou commented on a change in pull request #164: PARQUET-1950: Define core features

Reply via email to