[GitHub] [flink] hackeryard commented on a diff in pull request #20377: [FLINK-27338][hive] Improve splitting file for Hive table with orc format

GitBox Thu, 01 Dec 2022 00:11:54 -0800


hackeryard commented on code in PR #20377:
URL: https://github.com/apache/flink/pull/20377#discussion_r1036803986



##########
docs/content.zh/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -150,6 +150,41 @@ Flink 允许你灵活的配置并发推断策略。你可以在 `TableConfig` 
   </tbody>
 </table>
 
+### 读 Hive 表时调整数据分片（Split） 大小
+读 Hive 表时, 数据文件将会被切分为若干个分片（split）, 每一个分片是要读取的数据的一部分。
+分片是 Flink 进行任务分配和数据并行读取的基本粒度。
+用户可以通过下面的参数来调整每个分片的大小来做一定的读性能调优。
+
+<table class="table table-bordered">
+  <thead>
+    <tr>
+        <th class="text-left" style="width: 20%">Key</th>
+        <th class="text-left" style="width: 15%">Default</th>
+        <th class="text-left" style="width: 10%">Type</th>
+        <th class="text-left" style="width: 55%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+        <td><h5>table.exec.hive.split-max-size</h5></td>
+        <td style="word-wrap: break-word;">128mb</td>
+        <td>MemorySize</td>
+        <td>读 Hive 表时，每个分片最大可以包含的字节数 (默认是 128MB) 
+    </tr>
+    <tr>
+        <td><h5>table.exec.hive.file-open-cost</h5></td>
+        <td style="word-wrap: break-word;">4mb</td>
+        <td>MemorySize</td>
+        <td> 打开一个文件预估的开销，以字节为单位，默认是 4MB。
+             如果这个值比较大，Flink 则将会倾向于将 Hive 表切分为更少的分片，这在 Hive 表中包含大量小文件的时候很有用。

Review Comment:
   @luoyuxia when there are a lot of small files, such 1M/file, I think the 
split number is as the same of file numbers. So I don't understand why open 
cost is a good way to solved this problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] hackeryard commented on a diff in pull request #20377: [FLINK-27338][hive] Improve splitting file for Hive table with orc format

Reply via email to