(paimon) branch master updated: [doc] Specific true column names for manifest files

lzljs3620320 Sun, 15 Dec 2024 23:06:14 -0800

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new 9c2f6d15e5 [doc] Specific true column names for manifest files
9c2f6d15e5 is described below

commit 9c2f6d15e503f68841d52b5d70d77417c913cbcc
Author: Jingsong <[email protected]>
AuthorDate: Mon Dec 16 15:04:32 2024 +0800

    [doc] Specific true column names for manifest files
---
 docs/content/concepts/spec/datafile.md | 40 ++++++++++++--
 docs/content/concepts/spec/manifest.md | 97 ++++++++++++++++++++--------------
 2 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/docs/content/concepts/spec/datafile.md 
b/docs/content/concepts/spec/datafile.md
index 6ba43a421f..923a8da582 100644
--- a/docs/content/concepts/spec/datafile.md
+++ b/docs/content/concepts/spec/datafile.md
@@ -83,11 +83,45 @@ relationship between various table types and buckets in 
Paimon:
 The name of data file is `data-${uuid}-${id}.${format}`. For the append table, 
the file stores the data of the table
 without adding any new columns. But for the primary key table, each row of 
data stores additional system columns:
 
-1. `_VALUE_KIND`: row is deleted or added. Similar to RocksDB, each row of 
data can be deleted or added, which will be
+## Table with Primary key Data File
+
+1. Primary key columns, `_KEY_` prefix to key columns, this is to avoid 
conflicts with columns of the table. It's optional,
+   Paimon version 1.0 and above will retrieve the primary key fields from 
value_columns.
+2. `_VALUE_KIND`: TINYINT, row is deleted or added. Similar to RocksDB, each 
row of data can be deleted or added, which will be
    used for updating the primary key table.
-2. `_SEQUENCE_NUMBER`: this number is used for comparison during updates, 
determining which data came first and which
+3. `_SEQUENCE_NUMBER`: BIGINT, this number is used for comparison during 
updates, determining which data came first and which
    data came later.
-3. `_KEY_` prefix to key columns, this is to avoid conflicts with columns of 
the table.
+4. Value columns. All columns declared in the table.
+
+For example, data file for table:
+
+```sql
+CREATE TABLE T (
+    a INT PRIMARY KEY NOT ENFORCED,
+    b INT,
+    c INT
+);
+```
+
+Its file has 6 columns: `_KEY_a`, `_VALUE_KIND`, `_SEQUENCE_NUMBER`, `a`, `b`, 
`c`.
+
+When `data-file.thin-mode` enabled, its file has 5 columns: `_VALUE_KIND`, 
`_SEQUENCE_NUMBER`, `a`, `b`, `c`.
+
+## Table w/o Primary key Data File
+
+- Value columns. All columns declared in the table.
+
+For example, data file for table:
+
+```sql
+CREATE TABLE T (
+    a INT,
+    b INT,
+    c INT
+);
+```
+
+Its file has 3 columns: `a`, `b`, `c`.
 
 ## Changelog File
 
diff --git a/docs/content/concepts/spec/manifest.md 
b/docs/content/concepts/spec/manifest.md
index 8460febf78..9cc5afca0f 100644
--- a/docs/content/concepts/spec/manifest.md
+++ b/docs/content/concepts/spec/manifest.md
@@ -35,13 +35,13 @@ under the License.
 
 Manifest List includes meta of several manifest files. Its name contains UUID, 
it is a avro file, the schema is:
 
-1. fileName: manifest file name.
-2. fileSize: manifest file size.
-3. numAddedFiles: number added files in manifest.
-4. numDeletedFiles: number deleted files in manifest.
-5. partitionStats: partition stats, the minimum and maximum values of 
partition fields in this manifest are beneficial
+1. _FILE_NAME: STRING, manifest file name.
+2. _FILE_SIZE: BIGINT, manifest file size.
+3. _NUM_ADDED_FILES: BIGINT, number added files in manifest.
+4. _NUM_DELETED_FILES: BIGINT, number deleted files in manifest.
+5. _PARTITION_STATS: SimpleStats, partition stats, the minimum and maximum 
values of partition fields in this manifest are beneficial
    for skipping certain manifest files during queries, it is a SimpleStats.
-6. schemaId: schema id when writing this manifest file.
+6. _SCHEMA_ID: BIGINT, schema id when writing this manifest file.
 
 ## Manifest
 
@@ -63,31 +63,31 @@ Data Manifest includes meta of several data files or 
changelog files.
 
 The schema is:
 
-1. kind: ADD or DELETE,
-2. partition: partition spec, a BinaryRow.
-3. bucket: bucket of this file.
-4. totalBuckets: total buckets when write this file, it is used for 
verification after bucket changes.
-5. file: data file meta.
+1. _KIND: TINYINT, ADD or DELETE,
+2. _PARTITION: BYTES, partition spec, a BinaryRow.
+3. _BUCKET: INT, bucket of this file.
+4. _TOTAL_BUCKETS: INT, total buckets when write this file, it is used for 
verification after bucket changes.
+5. _FILE: data file meta.
 
 The data file meta is:
 
-1. fileName: file name.
-2. fileSize: file size.
-3. rowCount: total number of rows (including add & delete) in this file.
-4. minKey: the minimum key of this file.
-5. maxKey: the maximum key of this file.
-6. keyStats: the statistics of the key.
-7. valueStats: the statistics of the value.
-8. minSequenceNumber: the minimum sequence number.
-9. maxSequenceNumber: the maximum sequence number.
-10. schemaId: schema id when write this file.
-11. level: level of this file, in LSM.
-12. extraFiles: extra files for this file, for example, data file index file.
-13. creationTime: creation time of this file.
-14. deleteRowCount: rowCount = addRowCount + deleteRowCount.
-15. embeddedIndex: if data file index is too small, store the index in 
manifest.
-16. fileSource: indicate whether this file is generated as an APPEND or 
COMPACT file
-17. valueStatsCols: statistical column in metadata 
+1. _FILE_NAME: STRING, file name.
+2. _FILE_SIZE: BIGINT, file size.
+3. _ROW_COUNT: BIGINT, total number of rows (including add & delete) in this 
file.
+4. _MIN_KEY: STRING, the minimum key of this file.
+5. _MAX_KEY: STRING, the maximum key of this file.
+6. _KEY_STATS: SimpleStats, the statistics of the key.
+7. _VALUE_STATS: SimpleStats, the statistics of the value.
+8. _MIN_SEQUENCE_NUMBER: BIGINT, the minimum sequence number.
+9. _MAX_SEQUENCE_NUMBER: BIGINT, the maximum sequence number.
+10. _SCHEMA_ID: BIGINT, schema id when write this file.
+11. _LEVEL: INT, level of this file, in LSM.
+12. _EXTRA_FILES: ARRAY<STRING>, extra files for this file, for example, data 
file index file.
+13. _CREATION_TIME: TIMESTAMP_MILLIS, creation time of this file.
+14. _DELETE_ROW_COUNT: BIGINT, rowCount = addRowCount + deleteRowCount.
+15. _EMBEDDED_FILE_INDEX: BYTES, if data file index is too small, store the 
index in manifest.
+16. _FILE_SOURCE: TINYINT, indicate whether this file is generated as an 
APPEND or COMPACT file
+17. _VALUE_STATS_COLS: ARRAY<STRING>, statistical column in metadata 
 
 ### Index Manifest
 
@@ -100,20 +100,35 @@ Index Manifest includes meta of several [table-index]({{< 
ref "concepts/spec/tab
 
 The schema is:
 
-1. kind: ADD or DELETE,
-2. partition: partition spec, a BinaryRow.
-3. bucket: bucket of this file.
-4. indexFile: index file meta.
-
-The index file meta is:
-
-1. indexType: string, "HASH" or "DELETION_VECTORS".
-2. fileName: file name.
-3. fileSize: file size.
-4. rowCount: total number of rows.
-5. deletionVectorsRanges: Metadata only used by "DELETION_VECTORS", is an 
array of deletion vector meta, the schema of each deletion vector meta is:
+1. _KIND: TINYINT, ADD or DELETE,
+2. _PARTITION: BYTES, partition spec, a BinaryRow.
+3. _BUCKET: INT, bucket of this file.
+4. _INDEX_TYPE: STRING, "HASH" or "DELETION_VECTORS".
+5. _FILE_NAME: STRING, file name.
+6. _FILE_SIZE: BIGINT, file size.
+7. _ROW_COUNT: BIGINT, total number of rows.
+8. _DELETIONS_VECTORS_RANGES: Metadata only used by "DELETION_VECTORS", is an 
array of deletion vector meta, the schema of each deletion vector meta is:
    1. f0: the data file name corresponding to this deletion vector.
    2. f1: the starting offset of this deletion vector in the index file.
    3. f2: the length of this deletion vector in the index file.
-   4. cardinality: the number of deleted rows.
+   4. _CARDINALITY: the number of deleted rows.
+
+## Appendix
+
+### SimpleStats
+
+SimpleStats is nested row, the schema is:
+
+1. _MIN_VALUES: BYTES, BinaryRow, the minimum values of the columns.
+2. _MAX_VALUES: BYTES, BinaryRow, the maximum values of the columns.
+3. _NULL_COUNTS: ARRAY<BIGINT>, the number of nulls of the columns.
+
+### BinaryRow
+
+BinaryRow is backed by bytes instead of Object. It can significantly reduce 
the serialization/deserialization of Java
+objects.
 
+A Row has two part: Fixed-length part and variable-length part. Fixed-length 
part contains 1 byte header and null bit
+set and field values. Null bit set is used for null tracking and is aligned to 
8-byte word boundaries. `Field values`
+holds fixed-length primitive types and variable-length values which can be 
stored in 8 bytes inside. If it do not fit
+the variable-length field, then store the length and offset of variable-length 
part.

(paimon) branch master updated: [doc] Specific true column names for manifest files

Reply via email to