[GitHub] [flink] luoyuxia commented on a diff in pull request #22805: [FLINK-32365][orc]get orc table statistics in parallel

via GitHub Thu, 06 Jul 2023 18:53:00 -0700


luoyuxia commented on code in PR #22805:
URL: https://github.com/apache/flink/pull/22805#discussion_r1255113307



##########
docs/content.zh/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -190,6 +190,10 @@ Flink 允许你灵活的配置并发推断策略。你可以在 `TableConfig` 
 - 目前上述参数仅适用于 ORC 格式的 Hive 表。
 {{< /hint >}}
 
+### 读取表统计信息
+
+当hive 
metastore(如`orc`或`parquet`)中没有表的统计信息时，需要扫描表获取信息。你可以使用`table.exec.hive.read-statistics.thread-num`去配置扫描线程数。默认值是当前系统可用处理器数，你配置的值应该大于0。

Review Comment:
   suggestion：
   当hive metastore 中没有表的统计信息时，Flink 
会尝试扫描表来获取统计信息从而生成合适的执行计划。此过程可以会比较耗时，你可以使用`table.exec.hive.read-statistics.thread-num`去配置使用多少个线程去扫描表，默认值是当前系统可用处理器数，配置的值应该大于0。



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -208,7 +208,9 @@ Users can do some performance tuning by tuning the split's 
size with the follow
 
 ### Read Table Statistics
 
-To obtain hive table statistics faster, When hive table format is `orc` or 
`parquet`. You can use `table.exec.hive.read-format-statistics.thread-num` to 
configure the thread number. The default value is the number of available 
processors in the current system and the configured value should be bigger than 
0.
+When the table statistic is not available from Hive metastore, such as `orc` 
or `parquet`. We will then try to get the statistic by scanning the table. 

Review Comment:
   Suggestion:
   When the table statistic is not available from the Hive meta store, Flink 
will try to scan the table to get the statistic to generate a better execution 
plan. It may cost some time to get the statistic.
   



##########
docs/content/docs/connectors/table/hive/hive_read_write.md:
##########
@@ -208,7 +208,9 @@ Users can do some performance tuning by tuning the split's 
size with the follow
 
 ### Read Table Statistics
 
-To obtain hive table statistics faster, When hive table format is `orc` or 
`parquet`. You can use `table.exec.hive.read-format-statistics.thread-num` to 
configure the thread number. The default value is the number of available 
processors in the current system and the configured value should be bigger than 
0.
+When the table statistic is not available from Hive metastore, such as `orc` 
or `parquet`. We will then try to get the statistic by scanning the table. 

Review Comment:
   Suggestion:
   When the table statistic is not available from the Hive metastore, Flink 
will try to scan the table to get the statistic to generate a better execution 
plan. It may cost some time to get the statistic. To get it faster, you can use 
`table.exec.hive.read-statistics.thread-num` to configure how many threads to 
use to scan the table.
   The default value is the number of available processors in the current 
system and the configured value should be bigger than 0.
   



##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java:
##########
@@ -135,7 +135,7 @@ public class HiveOptions {
                                     + " Support to configure multiple 
policies: 'metastore,success-file'.");
 
     public static final ConfigOption<Integer> 
TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM =

Review Comment:
   TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM -> 
`TABLE_EXEC_HIVE_READ_STATISTICS_THREAD_NUM`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] luoyuxia commented on a diff in pull request #22805: [FLINK-32365][orc]get orc table statistics in parallel

Reply via email to