(parquet-site) branch production updated: GH-68: Match language from parquet-format after merge of PARQUET-2139 (#69)

gangwu Sun, 07 Jul 2024 19:25:41 -0700

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch production
in repository https://gitbox.apache.org/repos/asf/parquet-site.git



The following commit(s) were added to refs/heads/production by this push:
     new a407d81  GH-68: Match language from parquet-format after merge of 
PARQUET-2139 (#69)
a407d81 is described below

commit a407d81a41a90b58ae90a6567a84dd084b5d2947
Author: Ed Seidl <etse...@users.noreply.github.com>
AuthorDate: Sun Jul 7 19:25:32 2024 -0700

    GH-68: Match language from parquet-format after merge of PARQUET-2139 (#69)
---
 content/en/docs/File Format/_index.md   | 22 +++++++++++-----------
 content/en/docs/File Format/metadata.md | 13 +++++++++++--
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/content/en/docs/File Format/_index.md b/content/en/docs/File 
Format/_index.md
index 7d49ccb..3ca8fce 100644
--- a/content/en/docs/File Format/_index.md     
+++ b/content/en/docs/File Format/_index.md     
@@ -11,29 +11,29 @@ This file and the thrift definition should be read together 
to understand the fo
 
 ```
     4-byte magic number "PAR1"
-    <Column 1 Chunk 1 + Column Metadata>
-    <Column 2 Chunk 1 + Column Metadata>
+    <Column 1 Chunk 1>
+    <Column 2 Chunk 1>
     ...
-    <Column N Chunk 1 + Column Metadata>
-    <Column 1 Chunk 2 + Column Metadata>
-    <Column 2 Chunk 2 + Column Metadata>
+    <Column N Chunk 1>
+    <Column 1 Chunk 2>
+    <Column 2 Chunk 2>
     ...
-    <Column N Chunk 2 + Column Metadata>
+    <Column N Chunk 2>
     ...
-    <Column 1 Chunk M + Column Metadata>
-    <Column 2 Chunk M + Column Metadata>
+    <Column 1 Chunk M>
+    <Column 2 Chunk M>
     ...
-    <Column N Chunk M + Column Metadata>
+    <Column N Chunk M>
     File Metadata
     4-byte length in bytes of file metadata (little endian)
     4-byte magic number "PAR1"
 ```
 In the above example, there are N columns in this table, split into M row
-groups.  The file metadata contains the locations of all the column metadata
+groups.  The file metadata contains the locations of all the column chunk
 start locations.  More details on what is contained in the metadata can be 
found
 in the Thrift definition.
 
-Metadata is written after the data to allow for single pass writing.
+File metadata is written after the data to allow for single pass writing.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
diff --git a/content/en/docs/File Format/metadata.md b/content/en/docs/File 
Format/metadata.md
index a2eae25..f86b160 100644
--- a/content/en/docs/File Format/metadata.md   
+++ b/content/en/docs/File Format/metadata.md   
@@ -3,8 +3,17 @@ title: "Metadata"
 linkTitle: "Metadata"
 weight: 5
 ---
-There are three types of metadata: file metadata, column (chunk) metadata and 
page
-header metadata.  All thrift structures are serialized using the 
TCompactProtocol.
+There are two types of metadata: file metadata, and page header metadata.
+In the diagram below, file metadata is described by the `FileMetaData`
+structure. This file metadata provides offset and size information useful
+when navigating the Parquet file. Page header metadata (`PageHeader` and
+children in the diagram) is stored in-line with the page data, and is
+used in the reading and decoding of said data.
+
+
+All thrift structures are serialized using the TCompactProtocol. The full
+definition of these structures is given in the Parquet
+[Thrift 
definition](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
 
 
 ![File Layout](/images/FileFormat.gif)

(parquet-site) branch production updated: GH-68: Match language from parquet-format after merge of PARQUET-2139 (#69)

Reply via email to