This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 3009807cdc [doc] Fixes on typo, grammar and format (#5048)
3009807cdc is described below
commit 3009807cdcb794c180d71079eaea516c9e54d0a6
Author: Xiaoguang Zhu <[email protected]>
AuthorDate: Fri Mar 7 13:13:34 2025 +0800
[doc] Fixes on typo, grammar and format (#5048)
---
docs/content/append-table/query-performance.md | 18 +--
docs/content/append-table/streaming.md | 16 +--
docs/content/append-table/update.md | 2 +-
docs/content/concepts/spec/fileindex.md | 123 +++++++++++----------
docs/content/primary-key-table/compaction.md | 8 +-
docs/content/primary-key-table/overview.md | 2 +-
.../shortcodes/generated/core_configuration.html | 2 +-
.../paimon/arrow/writer/ArrowFieldWriter.java | 2 +-
.../src/main/java/org/apache/paimon/TableType.java | 2 +-
.../apache/paimon/fileindex/FileIndexFormat.java | 64 +++++------
.../fileindex/bitmap/BitmapFileIndexMeta.java | 26 ++---
.../apache/paimon/table/object/ObjectTable.java | 4 +-
12 files changed, 135 insertions(+), 134 deletions(-)
diff --git a/docs/content/append-table/query-performance.md
b/docs/content/append-table/query-performance.md
index 1c6ef7e1ae..e2128bbd89 100644
--- a/docs/content/append-table/query-performance.md
+++ b/docs/content/append-table/query-performance.md
@@ -30,17 +30,17 @@ under the License.
Paimon by default records the maximum and minimum values of each field in the
manifest file.
-In the query, according to the `WHERE` condition of the query, according to
the statistics in the manifest do files
-filtering, if the filtering effect is good, the query would have been minutes
of the query will be accelerated to
+In the query, according to the `WHERE` condition of the query, together with
the statistics in the manifest we can
+perform file filtering. If the filtering effect is good, the query that would
have cost minutes will be accelerated to
milliseconds to complete the execution.
-Often the data distribution is not always effective filtering, so if we can
sort the data by the field in `WHERE` condition?
-You can take a look at [Flink COMPACT Action]({{< ref
"maintenance/dedicated-compaction#sort-compact" >}}) or
+Often the data distribution is not always ideal for filtering, so can we sort
the data by the field in `WHERE` condition?
+You can take a look at [Flink COMPACT Action]({{< ref
"maintenance/dedicated-compaction#sort-compact" >}}),
[Flink COMPACT Procedure]({{< ref "flink/procedures" >}}) or [Spark COMPACT
Procedure]({{< ref "spark/procedures" >}}).
## Data Skipping By File Index
-You can use file index too, it filters files by index on the read side.
+You can use file index too, it filters files by indexing on the reading side.
```sql
CREATE TABLE <PAIMON_TABLE> (<COLUMN> <COLUMN_TYPE> , ...) WITH (
@@ -49,10 +49,10 @@ CREATE TABLE <PAIMON_TABLE> (<COLUMN> <COLUMN_TYPE> , ...)
WITH (
);
```
-Define `file-index.bloom-filter.columns`, Data file index is an external index
file and Paimon will create its corresponding index file for each file. If the
index
-file is too small, it will be stored directly in the manifest, otherwise in
the directory of the data file. Each data file
-corresponds to an index file, which has a separate file definition and can
contain different types of indexes with
-multiple columns.
+Define `file-index.bloom-filter.columns`, Data file index is an external index
file and Paimon will create its
+corresponding index file for each file. If the index file is too small, it
will be stored directly in the manifest,
+otherwise in the directory of the data file. Each data file corresponds to an
index file, which has a separate file
+definition and can contain different types of indexes with multiple columns.
Different file indexes may be efficient in different scenarios. For example
bloom filter may speed up query in point lookup
scenario. Using a bitmap may consume more space but can result in greater
accuracy.
diff --git a/docs/content/append-table/streaming.md
b/docs/content/append-table/streaming.md
index 80217ff6a9..3d843e4dbb 100644
--- a/docs/content/append-table/streaming.md
+++ b/docs/content/append-table/streaming.md
@@ -26,13 +26,13 @@ under the License.
# Streaming
-You can streaming write to the Append table in a very flexible way through
Flink, or through read the Append table
+You can stream write to the Append table in a very flexible way through Flink,
or read the Append table through
Flink, using it like a queue. The only difference is that its latency is in
minutes. Its advantages are very low cost
and the ability to push down filters and projection.
## Pre small files merging
-Pre means that this compact occurs before committing files to the snapshot.
+"Pre" means that this compact occurs before committing files to the snapshot.
If Flink's checkpoint interval is short (for example, 30 seconds), each
snapshot may produce lots of small changelog
files. Too many files may put a burden on the distributed storage cluster.
@@ -43,9 +43,9 @@ operator, which copies changelog files into large ones.
## Post small files merging
-Post means that this compact occurs after committing files to the snapshot.
+"Post" means that this compact occurs after committing files to the snapshot.
-In streaming writing job, without bucket definition, there is no compaction in
writer, instead, will use
+In streaming write job, without bucket definition, there is no compaction in
writer, instead, will use
`Compact Coordinator` to scan the small files and pass compaction task to
`Compact Worker`. In streaming mode, if you
run insert sql in flink, the topology will be like this:
@@ -55,8 +55,8 @@ Do not worry about backpressure, compaction never
backpressure.
If you set `write-only` to true, the `Compact Coordinator` and `Compact
Worker` will be removed in the topology.
-The auto compaction is only supported in Flink engine streaming mode. You can
also start a compaction job in flink by
-flink action in paimon and disable all the other compaction by set
`write-only`.
+The auto compaction is only supported in Flink engine streaming mode. You can
also start a compaction job in Flink by
+Flink action in Paimon and disable all the other compactions by setting
`write-only`.
## Streaming Query
@@ -64,8 +64,8 @@ You can stream the Append table and use it like a Message
Queue. As with primary
for streaming reads:
1. By default, Streaming read produces the latest snapshot on the table upon
first startup, and continue to read the
latest incremental records.
-2. You can specify `scan.mode` or `scan.snapshot-id` or
`scan.timestamp-millis` or `scan.file-creation-time-millis` to
- streaming read incremental only.
+2. You can specify `scan.mode`, `scan.snapshot-id`, `scan.timestamp-millis`
and/or `scan.file-creation-time-millis` to
+ stream read incremental only.
Similar to flink-kafka, order is not guaranteed by default, if your data has
some sort of order requirement, you also
need to consider defining a `bucket-key`, see [Bucketed Append]({{< ref
"append-table/bucketed" >}})
diff --git a/docs/content/append-table/update.md
b/docs/content/append-table/update.md
index d50b6be574..5e373cbd69 100644
--- a/docs/content/append-table/update.md
+++ b/docs/content/append-table/update.md
@@ -26,7 +26,7 @@ under the License.
# Update
-Now, only Spark SQL supports DELETE & UPDATE, you can take a look to [Spark
Write]({{< ref "spark/sql-write" >}}).
+Now, only Spark SQL supports DELETE & UPDATE, you can take a look at [Spark
Write]({{< ref "spark/sql-write" >}}).
Example:
```sql
diff --git a/docs/content/concepts/spec/fileindex.md
b/docs/content/concepts/spec/fileindex.md
index a5431b8495..230c01939c 100644
--- a/docs/content/concepts/spec/fileindex.md
+++ b/docs/content/concepts/spec/fileindex.md
@@ -36,38 +36,38 @@ multiple columns.
File index file format. Put all column and offset in the header.
<pre>
- _____________________________________ _____________________
-| magic |version|head length |
-|-------------------------------------|
-| column number |
-|-------------------------------------|
-| column 1 | index number |
-|-------------------------------------|
-| index name 1 |start pos |length |
-|-------------------------------------|
-| index name 2 |start pos |length |
-|-------------------------------------|
-| index name 3 |start pos |length |
-|-------------------------------------| HEAD
-| column 2 | index number |
-|-------------------------------------|
-| index name 1 |start pos |length |
-|-------------------------------------|
-| index name 2 |start pos |length |
-|-------------------------------------|
-| index name 3 |start pos |length |
-|-------------------------------------|
-| ... |
-|-------------------------------------|
-| ... |
-|-------------------------------------|
-| redundant length |redundant bytes |
-|-------------------------------------| ---------------------
-| BODY |
-| BODY |
-| BODY | BODY
-| BODY |
-|_____________________________________| _____________________
+ ______________________________________ _____________________
+| magic |version|head length |
+|--------------------------------------|
+| column number |
+|--------------------------------------|
+| column 1 | index number |
+|--------------------------------------|
+| index name 1 |start pos |length |
+|--------------------------------------|
+| index name 2 |start pos |length |
+|--------------------------------------|
+| index name 3 |start pos |length |
+|--------------------------------------| HEAD
+| column 2 | index number |
+|--------------------------------------|
+| index name 1 |start pos |length |
+|--------------------------------------|
+| index name 2 |start pos |length |
+|--------------------------------------|
+| index name 3 |start pos |length |
+|--------------------------------------|
+| ... |
+|--------------------------------------|
+| ... |
+|--------------------------------------|
+| redundant length |redundant bytes |
+|--------------------------------------| ---------------------
+| BODY |
+| BODY |
+| BODY | BODY
+| BODY |
+|______________________________________| _____________________
*
magic: 8 bytes long, value is 1493475289347502L,
BIG_ENDIAN
version: 4 bytes int, BIG_ENDIAN
@@ -168,31 +168,31 @@ length: 4 bytes int
Bitmap file index format (V1)
+-------------------------------------------------+-----------------
-| version (1 byte) |
+| version (1 byte) |
+-------------------------------------------------+
-| row count (4 bytes int) |
+| row count (4 bytes int) |
+-------------------------------------------------+
-| non-null value bitmap number (4 bytes int) |
+| non-null value bitmap number (4 bytes int) |
+-------------------------------------------------+
-| has null value (1 byte) |
+| has null value (1 byte) |
+-------------------------------------------------+
-| null value offset (4 bytes if has null value) | HEAD
+| null value offset (4 bytes if has null value) | HEAD
+-------------------------------------------------+
-| value 1 | offset 1 |
+| value 1 | offset 1 |
+-------------------------------------------------+
-| value 2 | offset 2 |
+| value 2 | offset 2 |
+-------------------------------------------------+
-| value 3 | offset 3 |
+| value 3 | offset 3 |
+-------------------------------------------------+
-| ... |
+| ... |
+-------------------------------------------------+-----------------
-| serialized bitmap 1 |
+| serialized bitmap 1 |
+-------------------------------------------------+
-| serialized bitmap 2 |
+| serialized bitmap 2 |
+-------------------------------------------------+ BODY
-| serialized bitmap 3 |
+| serialized bitmap 3 |
+-------------------------------------------------+
-| ... |
+| ... |
+-------------------------------------------------+-----------------
*
value x: var bytes for any data type (as bitmap
identifier)
@@ -200,6 +200,7 @@ offset: 4 bytes int (when it is
negative, it represents t
and its position is the inverse of the
negative value)
</pre>
+Integers are all BIG_ENDIAN.
Integer are all BIG_ENDIAN. In the paimon version that supports v2, the bitmap
index version defaults to v2.
Bitmap only support the following data type:
@@ -294,7 +295,7 @@ Bitmap only support the following data type:
## Index: Bit-Slice Index Bitmap
-BSI file index is a numeric range index, used to accelerate range query, it
can use with bitmap index.
+BSI file index is a numeric range index, used to accelerate range query, it
can be used with bitmap index.
Define `'file-index.bsi.columns'`.
@@ -303,17 +304,17 @@ BSI file index format (V1):
<pre>
BSI file index format (V1)
+-------------------------------------------------+
-| version (1 byte) |
+| version (1 byte) |
+-------------------------------------------------+
-| row count (4 bytes int) |
+| row count (4 bytes int) |
+-------------------------------------------------+
-| has positive value (1 byte) |
+| has positive value (1 byte) |
+-------------------------------------------------+
-| positive BSI serialized (if has positive value)|
+| positive BSI serialized (if has positive value) |
+-------------------------------------------------+
-| has negative value (1 byte) |
+| has negative value (1 byte) |
+-------------------------------------------------+
-| negative BSI serialized (if has negative value)|
+| negative BSI serialized (if has negative value) |
+-------------------------------------------------+
</pre>
@@ -321,23 +322,23 @@ BSI serialized format (V1):
<pre>
BSI serialized format (V1)
+-------------------------------------------------+
-| version (1 byte) |
+| version (1 byte) |
+-------------------------------------------------+
-| min value (8 bytes long) |
+| min value (8 bytes long) |
+-------------------------------------------------+
-| max value (8 bytes long) |
+| max value (8 bytes long) |
+-------------------------------------------------+
-| serialized existence bitmap |
+| serialized existence bitmap |
+-------------------------------------------------+
-| bit slice bitmap count (4 bytes int) |
+| bit slice bitmap count (4 bytes int) |
+-------------------------------------------------+
-| serialized bit 0 bitmap |
+| serialized bit 0 bitmap |
+-------------------------------------------------+
-| serialized bit 1 bitmap |
+| serialized bit 1 bitmap |
+-------------------------------------------------+
-| serialized bit 2 bitmap |
+| serialized bit 2 bitmap |
+-------------------------------------------------+
-| ... |
+| ... |
+-------------------------------------------------+
</pre>
diff --git a/docs/content/primary-key-table/compaction.md
b/docs/content/primary-key-table/compaction.md
index 38b302a0a0..416cbabfab 100644
--- a/docs/content/primary-key-table/compaction.md
+++ b/docs/content/primary-key-table/compaction.md
@@ -35,7 +35,7 @@ procedure is called compaction.
However, compaction is a resource intensive procedure which consumes a certain
amount of CPU time and disk IO, so too
frequent compaction may in turn result in slower writes. It is a trade-off
between query and write performance. Paimon
-currently adapts a compaction strategy similar to Rocksdb's [universal
compaction](https://github.com/facebook/rocksdb/wiki/Universal-Compaction).
+currently adopts a compaction strategy similar to Rocksdb's [universal
compaction](https://github.com/facebook/rocksdb/wiki/Universal-Compaction).
Compaction solves:
@@ -52,8 +52,8 @@ Writing performance is almost always affected by compaction,
so its tuning is cr
## Asynchronous Compaction
-Compaction is inherently asynchronous, but if you want it to be completely
asynchronous and not blocking writing,
-expect a mode to have maximum writing throughput, the compaction can be done
slowly and not in a hurry.
+Compaction is inherently asynchronous, but if you want it to be completely
asynchronous without blocking writes,
+expecting a mode for maximum writing throughput, the compaction can be done
slowly and not in a hurry.
You can use the following strategies for your table:
```shell
@@ -62,7 +62,7 @@ sort-spill-threshold = 10
lookup-wait = false
```
-This configuration will generate more files during peak write periods and
gradually merge into optimal read
+This configuration will generate more files during peak write periods and
gradually merge them for optimal read
performance during low write periods.
## Dedicated compaction job
diff --git a/docs/content/primary-key-table/overview.md
b/docs/content/primary-key-table/overview.md
index d99a9ff683..d5bbe83900 100644
--- a/docs/content/primary-key-table/overview.md
+++ b/docs/content/primary-key-table/overview.md
@@ -46,7 +46,7 @@ Also, see [rescale bucket]({{< ref
"maintenance/rescale-bucket" >}}) if you want
## LSM Trees
-Paimon adapts the LSM tree (log-structured merge-tree) as the data structure
for file storage. This documentation briefly introduces the concepts about LSM
trees.
+Paimon adopts the LSM tree (log-structured merge-tree) as the data structure
for file storage. This documentation briefly introduces the concepts about LSM
trees.
### Sorted Runs
diff --git a/docs/layouts/shortcodes/generated/core_configuration.html
b/docs/layouts/shortcodes/generated/core_configuration.html
index 93bbb012cc..a97092b4a3 100644
--- a/docs/layouts/shortcodes/generated/core_configuration.html
+++ b/docs/layouts/shortcodes/generated/core_configuration.html
@@ -1036,7 +1036,7 @@ If the data size allocated for the sorting task is
uneven,which may lead to perf
<td><h5>type</h5></td>
<td style="word-wrap: break-word;">table</td>
<td><p>Enum</p></td>
- <td>Type of the table.<br /><br />Possible values:<ul><li>"table":
Normal Paimon table.</li><li>"format-table": A file format table refers to a
directory that contains multiple files of the same
format.</li><li>"materialized-table": A materialized table combines normal
Paimon table and materialized SQL.</li><li>"object-table": A object table
combines normal Paimon table and object location.</li></ul></td>
+ <td>Type of the table.<br /><br />Possible values:<ul><li>"table":
Normal Paimon table.</li><li>"format-table": A file format table refers to a
directory that contains multiple files of the same
format.</li><li>"materialized-table": A materialized table combines normal
Paimon table and materialized SQL.</li><li>"object-table": An object table
combines normal Paimon table and object location.</li></ul></td>
</tr>
<tr>
<td><h5>write-buffer-for-append</h5></td>
diff --git
a/paimon-arrow/src/main/java/org/apache/paimon/arrow/writer/ArrowFieldWriter.java
b/paimon-arrow/src/main/java/org/apache/paimon/arrow/writer/ArrowFieldWriter.java
index 4df6b2f6ae..99a78e863c 100644
---
a/paimon-arrow/src/main/java/org/apache/paimon/arrow/writer/ArrowFieldWriter.java
+++
b/paimon-arrow/src/main/java/org/apache/paimon/arrow/writer/ArrowFieldWriter.java
@@ -45,7 +45,7 @@ public abstract class ArrowFieldWriter {
*
* @param columnVector Which holds the paimon data.
* @param pickedInColumn Which rows is picked to write. Pick all if null.
This is used to adapt
- * deletion vector.
+ * to deletion vector.
* @param startIndex From where to start writing.
* @param batchRows How many rows to write.
*/
diff --git a/paimon-common/src/main/java/org/apache/paimon/TableType.java
b/paimon-common/src/main/java/org/apache/paimon/TableType.java
index d9ac020f79..5f3f1f0d1f 100644
--- a/paimon-common/src/main/java/org/apache/paimon/TableType.java
+++ b/paimon-common/src/main/java/org/apache/paimon/TableType.java
@@ -33,7 +33,7 @@ public enum TableType implements DescribedEnum {
"materialized-table",
"A materialized table combines normal Paimon table and
materialized SQL."),
OBJECT_TABLE(
- "object-table", "A object table combines normal Paimon table and
object location.");
+ "object-table", "An object table combines normal Paimon table and
object location.");
private final String value;
private final String description;
diff --git
a/paimon-common/src/main/java/org/apache/paimon/fileindex/FileIndexFormat.java
b/paimon-common/src/main/java/org/apache/paimon/fileindex/FileIndexFormat.java
index 07ee5a18a0..9c8058f135 100644
---
a/paimon-common/src/main/java/org/apache/paimon/fileindex/FileIndexFormat.java
+++
b/paimon-common/src/main/java/org/apache/paimon/fileindex/FileIndexFormat.java
@@ -48,38 +48,38 @@ import java.util.stream.Collectors;
* File index file format. Put all column and offset in the header.
*
* <pre>
- * _____________________________________ _____________________
- * | magic |version|head length |
- * |-------------------------------------|
- * | column number |
- * |-------------------------------------|
- * | column 1 | index number |
- * |-------------------------------------|
- * | index name 1 |start pos |length |
- * |-------------------------------------|
- * | index name 2 |start pos |length |
- * |-------------------------------------|
- * | index name 3 |start pos |length |
- * |-------------------------------------| HEAD
- * | column 2 | index number |
- * |-------------------------------------|
- * | index name 1 |start pos |length |
- * |-------------------------------------|
- * | index name 2 |start pos |length |
- * |-------------------------------------|
- * | index name 3 |start pos |length |
- * |-------------------------------------|
- * | ... |
- * |-------------------------------------|
- * | ... |
- * |-------------------------------------|
- * | redundant length |redundant bytes |
- * |-------------------------------------| ---------------------
- * | BODY |
- * | BODY |
- * | BODY | BODY
- * | BODY |
- * |_____________________________________| _____________________
+ * ______________________________________ _____________________
+ * | magic |version|head length |
+ * |--------------------------------------|
+ * | column number |
+ * |--------------------------------------|
+ * | column 1 | index number |
+ * |--------------------------------------|
+ * | index name 1 |start pos |length |
+ * |--------------------------------------|
+ * | index name 2 |start pos |length |
+ * |--------------------------------------|
+ * | index name 3 |start pos |length |
+ * |--------------------------------------| HEAD
+ * | column 2 | index number |
+ * |--------------------------------------|
+ * | index name 1 |start pos |length |
+ * |--------------------------------------|
+ * | index name 2 |start pos |length |
+ * |--------------------------------------|
+ * | index name 3 |start pos |length |
+ * |--------------------------------------|
+ * | ... |
+ * |--------------------------------------|
+ * | ... |
+ * |--------------------------------------|
+ * | redundant length |redundant bytes |
+ * |--------------------------------------| ---------------------
+ * | BODY |
+ * | BODY |
+ * | BODY | BODY
+ * | BODY |
+ * |______________________________________| _____________________
*
* magic: 8 bytes long
* version: 4 bytes int
diff --git
a/paimon-common/src/main/java/org/apache/paimon/fileindex/bitmap/BitmapFileIndexMeta.java
b/paimon-common/src/main/java/org/apache/paimon/fileindex/bitmap/BitmapFileIndexMeta.java
index 8f934ca5e8..c631911885 100644
---
a/paimon-common/src/main/java/org/apache/paimon/fileindex/bitmap/BitmapFileIndexMeta.java
+++
b/paimon-common/src/main/java/org/apache/paimon/fileindex/bitmap/BitmapFileIndexMeta.java
@@ -61,31 +61,31 @@ import java.util.function.Function;
* <pre>
* Bitmap file index format (V1)
* +-------------------------------------------------+-----------------
- * | version (1 byte) |
+ * | version (1 byte) |
* +-------------------------------------------------+
- * | row count (4 bytes int) |
+ * | row count (4 bytes int) |
* +-------------------------------------------------+
- * | non-null value bitmap number (4 bytes int) |
+ * | non-null value bitmap number (4 bytes int) |
* +-------------------------------------------------+
- * | has null value (1 byte) |
+ * | has null value (1 byte) |
* +-------------------------------------------------+
- * | null value offset (4 bytes if has null value) | HEAD
+ * | null value offset (4 bytes if has null value) | HEAD
* +-------------------------------------------------+
- * | value 1 | offset 1 |
+ * | value 1 | offset 1 |
* +-------------------------------------------------+
- * | value 2 | offset 2 |
+ * | value 2 | offset 2 |
* +-------------------------------------------------+
- * | value 3 | offset 3 |
+ * | value 3 | offset 3 |
* +-------------------------------------------------+
- * | ... |
+ * | ... |
* +-------------------------------------------------+-----------------
- * | serialized bitmap 1 |
+ * | serialized bitmap 1 |
* +-------------------------------------------------+
- * | serialized bitmap 2 |
+ * | serialized bitmap 2 |
* +-------------------------------------------------+ BODY
- * | serialized bitmap 3 |
+ * | serialized bitmap 3 |
* +-------------------------------------------------+
- * | ... |
+ * | ... |
* +-------------------------------------------------+-----------------
*
* value x: var bytes for any data type (as bitmap
identifier)
diff --git
a/paimon-core/src/main/java/org/apache/paimon/table/object/ObjectTable.java
b/paimon-core/src/main/java/org/apache/paimon/table/object/ObjectTable.java
index 992425ed52..4013fede49 100644
--- a/paimon-core/src/main/java/org/apache/paimon/table/object/ObjectTable.java
+++ b/paimon-core/src/main/java/org/apache/paimon/table/object/ObjectTable.java
@@ -37,8 +37,8 @@ import static
org.apache.paimon.utils.Preconditions.checkArgument;
import static org.apache.paimon.utils.Preconditions.checkNotNull;
/**
- * A object table refers to a directory that contains multiple objects
(files), Object table
- * provides metadata indexes for unstructured data objects in this directory.
Allowing users to
+ * An object table refers to a directory that contains multiple objects
(files). Object table
+ * provides metadata indexes for unstructured data objects in this directory,
allowing users to
* analyze unstructured data in Object Storage.
*
* <p>Object Table stores the metadata of objects in the underlying table.