This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 6c2a95f94 [doc] Document File Format for write performance
6c2a95f94 is described below
commit 6c2a95f948018c0c39467d8fc13eabc11fa40e59
Author: JingsongLi <[email protected]>
AuthorDate: Wed Jul 5 21:05:05 2023 +0800
[doc] Document File Format for write performance
---
docs/content/maintenance/write-performance.md | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/docs/content/maintenance/write-performance.md
b/docs/content/maintenance/write-performance.md
index 140173c6f..502b05f1d 100644
--- a/docs/content/maintenance/write-performance.md
+++ b/docs/content/maintenance/write-performance.md
@@ -135,6 +135,26 @@ One can easily see that too many sorted runs will result
in poor query performan
Compaction will become less frequent when `num-sorted-run.compaction-trigger`
becomes larger, thus improving writing performance. However, if this value
becomes too large, more memory and CPU time will be needed when querying the
table. This is a trade-off between writing and query performance.
+## File Format
+
+If you want to achieve ultimate compaction performance, you can consider using
row storage file format AVRO.
+- The advantage is that you can achieve high write throughput and compaction
performance.
+- The disadvantage is that your analysis queries will be slow, and the biggest
problem with row storage is that it
+ does not have the query projection. For example, if the table have 100
columns but only query a few columns, the
+ IO of row storage cannot be ignored. Additionally, compression efficiency
will decrease and storage costs will
+ increase.
+
+This a tradeoff.
+
+Enable row storage through the following options:
+```shell
+file.format = avro
+metadata.stats-mode = none
+```
+
+The collection of statistical information for row storage is a bit expensive,
so I suggest turning off statistical
+information as well.
+
## Write Initialize
In the initialization of write, the writer of the bucket needs to read all
historical files. If there is a bottleneck