This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 473124d636 [doc] Document file format for orc avro csv
473124d636 is described below
commit 473124d6369f19a629b1c97fca1fd1d626c39dfb
Author: JingsongLi <[email protected]>
AuthorDate: Wed Aug 20 14:03:10 2025 +0800
[doc] Document file format for orc avro csv
---
docs/content/concepts/spec/fileformat.md | 350 ++++++++++++++++++++++++++++++-
1 file changed, 341 insertions(+), 9 deletions(-)
diff --git a/docs/content/concepts/spec/fileformat.md
b/docs/content/concepts/spec/fileformat.md
index 9ad6164740..98798f7da0 100644
--- a/docs/content/concepts/spec/fileformat.md
+++ b/docs/content/concepts/spec/fileformat.md
@@ -26,10 +26,10 @@ under the License.
# File Format
-Currently, supports Parquet, Avro, ORC, JSON, CSV file formats.
+Currently, supports Parquet, Avro, ORC, CSV, JSON file formats.
- Recommended column format is Parquet, which has a high compression rate and
fast column projection queries.
- Recommended row based format is Avro, which has good performance n reading
and writing full row (all columns).
-- Recommended testing format is JSON, which has better readability but the
worst storage and read-write performance.
+- Recommended testing format is CSV, which has better readability but the
worst read-write performance.
## PARQUET
@@ -142,25 +142,357 @@ The following table lists the type mapping from Paimon
type to Parquet type.
Limitations:
1. [Parquet does not support nullable map
keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps).
+## AVRO
+
+The following table lists the type mapping from Paimon type to Avro type.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left">Paimon type</th>
+ <th class="text-left">Avro type</th>
+ <th class="text-left">Avro logical type</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>CHAR / VARCHAR / STRING</td>
+ <td>string</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>BOOLEAN</code></td>
+ <td><code>boolean</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>BINARY / VARBINARY</code></td>
+ <td><code>bytes</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>DECIMAL</code></td>
+ <td><code>bytes</code></td>
+ <td><code>decimal</code></td>
+ </tr>
+ <tr>
+ <td><code>TINYINT</code></td>
+ <td><code>int</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>SMALLINT</code></td>
+ <td><code>int</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>INT</code></td>
+ <td><code>int</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>BIGINT</code></td>
+ <td><code>long</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>FLOAT</code></td>
+ <td><code>float</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>DOUBLE</code></td>
+ <td><code>double</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>DATE</code></td>
+ <td><code>int</code></td>
+ <td><code>date</code></td>
+ </tr>
+ <tr>
+ <td><code>TIME</code></td>
+ <td><code>int</code></td>
+ <td><code>time-millis</code></td>
+ </tr>
+ <tr>
+ <td><code>TIMESTAMP</code></td>
+ <td>P <= 3: long, P <= 6: long, P > 6: unsupported</td>
+ <td>P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6:
unsupported</td>
+ </tr>
+ <tr>
+ <td><code>TIMESTAMP_LOCAL_ZONE</code></td>
+ <td>P <= 3: long, P <= 6: long, P > 6: unsupported</td>
+ <td>P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6:
unsupported</td>
+ </tr>
+ <tr>
+ <td><code>ARRAY</code></td>
+ <td><code>array</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>MAP</code><br>
+ (key must be string/char/varchar type)</td>
+ <td><code>map</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>MULTISET</code><br>
+ (element must be string/char/varchar type)</td>
+ <td><code>map</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td><code>ROW</code></td>
+ <td><code>record</code></td>
+ <td></td>
+ </tr>
+ </tbody>
+</table>
+
+In addition to the types listed above, for nullable types. Paimon maps
nullable types to Avro `union(something, null)`,
+where `something` is the Avro type converted from Paimon type.
+
+You can refer to [Avro
Specification](https://avro.apache.org/docs/1.12.0/specification/) for more
information about Avro types.
+
## ORC
-TODO
+The following table lists the type mapping from Paimon type to Orc type.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left">Paimon Type</th>
+ <th class="text-center">Orc physical type</th>
+ <th class="text-center">Orc logical type</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>CHAR</td>
+ <td>bytes</td>
+ <td>CHAR</td>
+ </tr>
+ <tr>
+ <td>VARCHAR</td>
+ <td>bytes</td>
+ <td>VARCHAR</td>
+ </tr>
+ <tr>
+ <td>STRING</td>
+ <td>bytes</td>
+ <td>STRING</td>
+ </tr>
+ <tr>
+ <td>BOOLEAN</td>
+ <td>long</td>
+ <td>BOOLEAN</td>
+ </tr>
+ <tr>
+ <td>BYTES</td>
+ <td>bytes</td>
+ <td>BINARY</td>
+ </tr>
+ <tr>
+ <td>DECIMAL</td>
+ <td>decimal</td>
+ <td>DECIMAL</td>
+ </tr>
+ <tr>
+ <td>TINYINT</td>
+ <td>long</td>
+ <td>BYTE</td>
+ </tr>
+ <tr>
+ <td>SMALLINT</td>
+ <td>long</td>
+ <td>SHORT</td>
+ </tr>
+ <tr>
+ <td>INT</td>
+ <td>long</td>
+ <td>INT</td>
+ </tr>
+ <tr>
+ <td>BIGINT</td>
+ <td>long</td>
+ <td>LONG</td>
+ </tr>
+ <tr>
+ <td>FLOAT</td>
+ <td>double</td>
+ <td>FLOAT</td>
+ </tr>
+ <tr>
+ <td>DOUBLE</td>
+ <td>double</td>
+ <td>DOUBLE</td>
+ </tr>
+ <tr>
+ <td>DATE</td>
+ <td>long</td>
+ <td>DATE</td>
+ </tr>
+ <tr>
+ <td>TIMESTAMP</td>
+ <td>timestamp</td>
+ <td>TIMESTAMP</td>
+ </tr>
+ <tr>
+ <td>TIMESTAMP_LOCAL_ZONE</td>
+ <td>timestamp</td>
+ <td>TIMESTAMP_INSTANT</td>
+ </tr>
+ <tr>
+ <td>ARRAY</td>
+ <td>-</td>
+ <td>LIST</td>
+ </tr>
+ <tr>
+ <td>MAP</td>
+ <td>-</td>
+ <td>MAP</td>
+ </tr>
+ <tr>
+ <td>ROW</td>
+ <td>-</td>
+ <td>STRUCT</td>
+ </tr>
+ </tbody>
+</table>
Limitations:
1. ORC has a time zone bias when mapping `TIMESTAMP_LOCAL_ZONE` type, saving
the millis value corresponding to the UTC
literal time. Due to compatibility issues, this behavior cannot be modified.
-## AVRO
+## CSV
-TODO
+Experimental feature, not recommended for production.
-## JSON
+Format Options:
-Experimental feature, not recommended for production.
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 25%">Option</th>
+ <th class="text-center" style="width: 7%">Default</th>
+ <th class="text-center" style="width: 10%">Type</th>
+ <th class="text-center" style="width: 42%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><h5>csv.field-delimiter</h5></td>
+ <td style="word-wrap: break-word;"><code>,</code></td>
+ <td>String</td>
+ <td>Field delimiter character (<code>','</code> by default), must be
single character. You can use backslash to specify special characters, e.g.
<code>'\t'</code> represents the tab character.
+ </td>
+ </tr>
+ <tr>
+ <td><h5>csv.line-delimiter</h5></td>
+ <td style="word-wrap: break-word;"><code>\n</code></td>
+ <td>String</td>
+ <td>The line delimiter for CSV format</td>
+ </tr>
+ <tr>
+ <td><h5>csv.quote-character</h5></td>
+ <td style="word-wrap: break-word;"><code>"</code></td>
+ <td>String</td>
+ <td>Quote character for enclosing field values (<code>"</code> by
default).</td>
+ </tr>
+ <tr>
+ <td><h5>csv.escape-character</h5></td>
+ <td style="word-wrap: break-word;">\</td>
+ <td>String</td>
+ <td>The escape character for CSV format.</td>
+ </tr>
+ <tr>
+ <td><h5>csv.include-header</h5></td>
+ <td style="word-wrap: break-word;">false</td>
+ <td>Boolean</td>
+ <td>Whether to include header in CSV files.</td>
+ </tr>
+ <tr>
+ <td><h5>csv.null-literal</h5></td>
+ <td style="word-wrap: break-word;">(none)</td>
+ <td>String</td>
+ <td>Null literal string that is interpreted as a null value (disabled by
default).</td>
+ </tr>
+ </tbody>
+</table>
-TODO
+Paimon CSV format uses [jackson databind
API](https://github.com/FasterXML/jackson-databind) to parse and generate CSV
string.
-## CSV
+The following table lists the type mapping from Paimon type to CSV type.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left">Paimon type</th>
+ <th class="text-left">CSV type</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><code>CHAR / VARCHAR / STRING</code></td>
+ <td><code>string</code></td>
+ </tr>
+ <tr>
+ <td><code>BOOLEAN</code></td>
+ <td><code>boolean</code></td>
+ </tr>
+ <tr>
+ <td><code>BINARY / VARBINARY</code></td>
+ <td><code>string with encoding: base64</code></td>
+ </tr>
+ <tr>
+ <td><code>DECIMAL</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>TINYINT</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>SMALLINT</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>INT</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>BIGINT</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>FLOAT</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>DOUBLE</code></td>
+ <td><code>number</code></td>
+ </tr>
+ <tr>
+ <td><code>DATE</code></td>
+ <td><code>string with format: date</code></td>
+ </tr>
+ <tr>
+ <td><code>TIME</code></td>
+ <td><code>string with format: time</code></td>
+ </tr>
+ <tr>
+ <td><code>TIMESTAMP</code></td>
+ <td><code>string with format: date-time</code></td>
+ </tr>
+ <tr>
+ <td><code>TIMESTAMP_LOCAL_ZONE</code></td>
+ <td><code>string with format: date-time</code></td>
+ </tr>
+ </tbody>
+</table>
+
+## JSON
Experimental feature, not recommended for production.