[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328912400
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
 
 Review comment:
   “和大小文件和并行度” => “并调整文件大小和Spark并行度”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328906434
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -70,8 +70,7 @@ Metadata for table trips loaded
 hoodie:trips->
 ```
 
-Once connected to the dataset, a lot of other commands become available. The 
shell has contextual autocomplete help (press TAB) and below is a list of all 
commands, few of which are reviewed in this section
-are reviewed
+连接到数据集后,便可使用许多其他命令。该shell程序具有上下文自动完成帮助(按TAB键),下面是所有命令的列表,本节中没有对其中的一些内容进行了回顾。
 
 Review comment:
   “本节中没有对其中的一些内容进行了回顾” => “本节中对其中的一些命令进行了详细示例”
   Based on the content, the "few" here should actually be "a few".


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328911799
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
 
 Review comment:
   “如果重复的文件跨越整个分区路径” => “如果重复的记录存在于不同分区路径下的文件”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328912072
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
 Review comment:
   “如果重复跨越同一分区路径中的多个文件” => “如果重复的记录存在于同一分区路径下的多个文件”
   “请使用邮件列表” => “...汇报这个问题”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328908683
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -216,16 +211,14 @@ hoodie:trips->stats wa
 ```
 
 
- Archived Commits
+ 归档的提交
 
-In order to limit the amount of growth of .commit files on DFS, Hudi archives 
older .commit files (with due respect to the cleaner policy) into a 
commits.archived file.
-This is a sequence file that contains a mapping from commitNumber => json with 
raw information about the commit (same that is nicely rolled up above).
+为了限制DFS上.commit文件的增长量,Hudi将较旧的.commit文件(适当考虑清理策略)归档到commits.archived文件中。
+这是一个序列文件,其包含commitNumber => json的映射,其包含有关提交的原始信息(上面已很好地汇总了相同的信息)。
 
 Review comment:
   “其包含有关提交的原始信息” => “及有关提交的原始信息”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328906022
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -107,12 +106,12 @@ hoodie:trips->
 ```
 
 Review comment:
   Maybe in another PR, it would be good to also describe these commands in 
Chinese, besides showing the sample help output.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328908048
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -188,9 +184,9 @@ order (See Concepts). The below commands allow users to 
view the file-slices for
 ```
 
 
- Statistics
+ 统计信息
 
-Since Hudi directly manages file sizes for DFS dataset, it might be good to 
get an overall picture
+由于Hudi直接管理DFS数据集的文件大小,因此了解可能会很全面
 
 Review comment:
   “因此了解可能会很全面” => “这些信息会帮助你全面了解Hudi的运行状况”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328905466
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -4,24 +4,24 @@ keywords: hudi, administration, operation, devops
 sidebar: mydoc_sidebar
 permalink: admin_guide.html
 toc: false
-summary: This section offers an overview of tools available to operate an 
ecosystem of Hudi datasets
+summary: 本节概述了可用于操作Hudi数据集生态系统的工具
 ---
 
-Admins/ops can gain visibility into Hudi datasets/pipelines in the following 
ways
+管理员/运维人员可以通过以下方式了解Hudi数据集/管道
 
- - [Administering via the Admin CLI](#admin-cli)
- - [Graphite metrics](#metrics)
- - [Spark UI of the Hudi Application](#spark-ui)
+ - [通过Admin CLI进行管理](#admin-cli)
+ - [Graphite指标](#metrics)
+ - [Hudi应用程序的Spark UI](#spark-ui)
 
-This section provides a glimpse into each of these, with some general guidance 
on [troubleshooting](#troubleshooting)
+本节简要介绍了每一种方法,并提供了有关[疑难解答](#troubleshooting)的一些常规指南
 
 Review comment:
“疑难解答” => “故障排除”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328910547
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
 
 Review comment:
   “对分类重复非常有用” => “对检查重复非常有用”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328912529
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
 
 Review comment:
   “两次显示了” => “显示了两次”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328913967
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
 
 
 
 
 
-At a high level, there are two steps
+概括地说,有两个步骤
 
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
 
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and 
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by 
joining RDDs in 1 &

[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328906596
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -107,12 +106,12 @@ hoodie:trips->
 ```
 
 
- Inspecting Commits
+ 检查提交
 
-The task of upserting or inserting a batch of incoming records is known as a 
**commit** in Hudi. A commit provides basic atomicity guarantees such that only 
commited data is available for querying.
-Each commit has a monotonically increasing string/number called the **commit 
number**. Typically, this is the time at which we started the commit.
+在Hudi中,更新或插入一批记录的任务称为**提交**。提交可提供基本的原子性保证,以便仅提交的数据可用于查询。
 
 Review comment:
   “称为” => “被称为”
   “以便仅提交的数据可用于查询” => “即只有提交的数据可用于查询”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328910381
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
 Review comment:
   “将以下元数据添加到每条记录中” => “以下元数据已被添加到每条记录中”
   “以帮助轻松分类问题” => “可以通过标准Hadoop SQL引擎(Hive/Presto/Spark)检索,来更容易地诊断问题的严重性”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328914068
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
 
 
 
 
 
-At a high level, there are two steps
+概括地说,有两个步骤
 
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
 
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and 
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by 
joining RDDs in 1 &

[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328909279
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -333,38 +326,36 @@ hoodie:stock_ticks_mor->compaction validate --instant 
20181005222601
 | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet|
 1  | false| All log files specified in compaction operation is not 
present. Missing |
 ```
 
-# NOTE
+# 注意
 
-The following commands must be executed without any other writer/ingestion 
application running.
+必须在不运行任何其他写如/摄取程序的情况下执行以下命令。
 
-Sometimes, it becomes necessary to remove a fileId from a compaction-plan 
inorder to speed-up or unblock compaction
-operation. Any new log-files that happened on this file after the compaction 
got scheduled will be safely renamed
-so that are preserved. Hudi provides the following CLI to support it
+有时,有必要从压缩计划中删除fileId以便加快或取消压缩操作。
+压缩计划之后在此文件上发生的所有新日志文件都将被安全地重命名以便进行保留。Hudi提供以下CLI来支持
 
 
-# UnScheduling Compaction
+# 取消调度压缩
 
 ```
 hoodie:trips->compaction unscheduleFileId --fileId 
 
 No File renames needed to unschedule file from pending compaction. Operation 
successful.
 ```
 
-In other cases, an entire compaction plan needs to be reverted. This is 
supported by the following CLI
+在其他情况下,需要恢复整个压缩计划。以下CLI支持此功能
 
 Review comment:
   “恢复” => “撤销”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328911497
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
 Review comment:
   =>
   “首先,请确保访问Hudi数据集的查询是[没有问题的](sql_queries.html),并之后确认的确有重复。”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328908457
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -203,8 +199,7 @@ hoodie:trips->stats filesizes --partitionPath 2016/09/01 
--sortBy "95th" --desc
 
 ```
 
-In case of Hudi write taking much longer, it might be good to see the write 
amplification for any sudden increases
-
+如果Hudi写入花费的时间更长,那么如果突然增加写入量可查看写放大
 
 Review comment:
   “那么如果突然增加写入量可查看写放大” => “那么可以通过观察写放大指标来发现任何异常”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328912878
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
 
 
 
 
 
-At a high level, there are two steps
+概括地说,有两个步骤
 
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
 
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and 
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by 
joining RDDs in 1 &

[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328913367
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
 
 
 
 
 
-At a high level, there are two steps
+概括地说,有两个步骤
 
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
 
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and 
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by 
joining RDDs in 1 &

[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328909718
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
 
 Review comment:
   “新文件” => “文件”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328907242
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -126,18 +125,17 @@ hoodie:trips->commits show --sortBy "Total Bytes 
Written" --desc true --limit 10
 hoodie:trips->
 ```
 
-At the start of each write, Hudi also writes a .inflight commit to the .hoodie 
folder. You can use the timestamp there to estimate how long the commit has 
been inflight
-
+在每次写入开始时,Hudi还将.inflight提交写入.hoodie文件夹。您可以使用那里的时间戳来估计提交使用的时间
 
 Review comment:
   “提交使用的时间” => “正在进行的提交已经花费的时间”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328913585
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
+如果确实发现错误,那么记录实际上不是由Hudi写入的,而是交还给应用程序来决定如何处理。
 
-First of all, please confirm if you do indeed have duplicates **AFTER** 
ensuring the query is accessing the Hudi datasets [properly](sql_queries.html) .
+ 重复
 
- - If confirmed, please use the metadata fields above, to identify the 
physical files & partition files containing the records .
- - If duplicates span files across partitionpath, then this means your 
application is generating different partitionPaths for same recordKey, Please 
fix your app
- - if duplicates span multiple files within the same partitionpath, please 
engage with mailing list. This should not happen. You can use the `records 
deduplicate` command to fix your data.
+首先,请确认是否确实存在重复**AFTER**,以确保查询可以[正确](sql_queries.html)访问Hudi数据集。
 
- Spark failures {#spark-ui}
+ - 如果确认,请使用上面的元数据字段来标识包含记录的物理文件和分区文件。
+ - 如果重复的文件跨越整个分区路径,则意味着您的应用程序正在为同一recordKey生成不同的分区路径,请修复您的应用程序.
+ - 如果重复跨越同一分区路径中的多个文件,请使用邮件列表。这不应该发生。您可以使用`records deduplicate`命令修复数据。
 
-Typical upsert() DAG looks like below. Note that Hudi client also caches 
intermediate RDDs to intelligently profile workload and size files and spark 
parallelism.
-Also Spark UI shows sortByKey twice due to the probe job also being shown, 
nonetheless its just a single sort.
+ Spark故障 {#spark-ui}
 
+典型的upsert() DAG如下所示。请注意,Hudi客户端会缓存中间的RDD,以智能地分析工作负载和大小文件和并行度。
+另外,由于还显示了探针作业,Spark UI两次显示了sortByKey,但它只是一个排序。
 
 
 
 
 
-At a high level, there are two steps
+概括地说,有两个步骤
 
-**Index Lookup to identify files to be changed**
+**索引查找以标识要更改的文件**
 
- - Job 1 : Triggers the input data read, converts to HoodieRecord object and 
then stops at obtaining a spread of input records to target partition paths
- - Job 2 : Load the set of file names which we need check against
- - Job 3  & 4 : Actual lookup after smart sizing of spark join parallelism, by 
joining RDDs in 1 &

[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328909060
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -333,38 +326,36 @@ hoodie:stock_ticks_mor->compaction validate --instant 
20181005222601
 | 05320e98-9a57-4c38-b809-a6beaaeb36bd| 20181005222445   | 
hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/2018/08/31/05320e98-9a57-4c38-b809-a6beaaeb36bd_0_20181005222445.parquet|
 1  | false| All log files specified in compaction operation is not 
present. Missing |
 ```
 
-# NOTE
+# 注意
 
-The following commands must be executed without any other writer/ingestion 
application running.
+必须在不运行任何其他写如/摄取程序的情况下执行以下命令。
 
 Review comment:
   “必须在不运行任何其他写如/摄取程序的情况下执行以下命令”
   =>
   “必须在其他写入/摄取程序没有运行的情况下执行以下命令”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328910903
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
 Review comment:
   “即每个分区内强制recordKey的唯一性”
   =>
   “即仅在每个分区内保证recordKey(主键)的唯一性”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #926: [HUDI-278] Translate Administering page

2019-09-26 Thread GitBox
yihua commented on a change in pull request #926: [HUDI-278] Translate 
Administering page
URL: https://github.com/apache/incubator-hudi/pull/926#discussion_r328911008
 
 

 ##
 File path: docs/admin_guide.cn.md
 ##
 @@ -374,71 +365,69 @@ Compaction successfully repaired
 ```
 
 
-## Metrics {#metrics}
+## 指标 {#metrics}
 
-Once the Hudi Client is configured with the right datasetname and environment 
for metrics, it produces the following graphite metrics, that aid in debugging 
hudi datasets
+为Hudi Client配置正确的数据集名称和指标环境后,它将生成以下graphite指标,以帮助调试hudi数据集
 
- - **Commit Duration** - This is amount of time it took to successfully commit 
a batch of records
- - **Rollback Duration** - Similarly, amount of time taken to undo partial 
data left over by a failed commit (happens everytime automatically after a 
failing write)
- - **File Level metrics** - Shows the amount of new files added, versions, 
deleted (cleaned) in each commit
- - **Record Level Metrics** - Total records inserted/updated etc per commit
- - **Partition Level metrics** - number of partitions upserted (super useful 
to understand sudden spikes in commit duration)
+ - **提交持续时间** - 这是成功提交一批记录所花费的时间
+ - **回滚持续时间** - 同样,撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
+ - **文件级别指标** - 显示每次提交中新增、版本、删除(清除)的新文件数量
+ - **记录级别指标** - 每次提交插入/更新的记录总数
+ - **分区级别指标** - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
 
-These metrics can then be plotted on a standard tool like grafana. Below is a 
sample commit duration chart.
+然后可以将这些指标绘制在grafana等标准工具上。以下是提交持续时间图表示例。
 
 
 
 
 
 
-## Troubleshooting Failures {#troubleshooting}
-
-Section below generally aids in debugging Hudi failures. Off the bat, the 
following metadata is added to every record to help triage  issues easily using 
standard Hadoop SQL engines (Hive/Presto/Spark)
-
- - **_hoodie_record_key** - Treated as a primary key within each DFS 
partition, basis of all updates/inserts
- - **_hoodie_commit_time** - Last commit that touched this record
- - **_hoodie_file_name** - Actual file name containing the record (super 
useful to triage duplicates)
- - **_hoodie_partition_path** - Path from basePath that identifies the 
partition containing this record
+## 故障排除 {#troubleshooting}
 
-Note that as of now, Hudi assumes the application passes in the same 
deterministic partitionpath for a given recordKey. i.e the uniqueness of record 
key is only enforced within each partition
+以下部分通常有助于调试Hudi故障。将以下元数据添加到每条记录中,以帮助使用标准Hadoop SQL引擎(Hive/Presto/Spark)轻松分类问题。
 
+ - **_hoodie_record_key** - 作为每个DFS分区内的主键,是所有更新/插入的基础
+ - **_hoodie_commit_time** - 该记录上次的提交
+ - **_hoodie_file_name** - 包含记录的实际文件名(对分类重复非常有用)
+ - **_hoodie_partition_path** - basePath的路径,该路径标识包含此记录的分区
 
- Missing records
+请注意,到目前为止,Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即每个分区内强制recordKey的唯一性。
 
-Please check if there were any write errors using the admin commands above, 
during the window at which the record could have been written.
-If you do find errors, then the record was not actually written by Hudi, but 
handed back to the application to decide what to do with it.
+ 缺失记录
 
- Duplicates
+请在可以写入记录的窗口中,使用上面的admin命令检查是否存在任何写入错误。
 
 Review comment:
   “可以” => “可能”


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #50

2019-09-26 Thread Apache Jenkins Server
See 


--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on H40 (ubuntu xenial) in workspace 

No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone
Cloning repository https://github.com/apache/incubator-hudi.git
 > git init  # 
 > timeout=10
Fetching upstream changes from https://github.com/apache/incubator-hudi.git
 > git --version # timeout=10
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/incubator-hudi.git 
 > +refs/heads/*:refs/remotes/origin/* --depth=1
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Error performing git command
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2051)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1761)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$400(CliGitAPIImpl.java:72)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:442)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:655)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to 
H40
at 
hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
at 
hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)
at hudson.remoting.Channel.call(Channel.java:955)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor1084.invoke(Unknown 
Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy135.execute(Unknown Source)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1152)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1192)
at hudson.scm.SCM.checkout(SCM.java:504)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1810)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at 
hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at hudson.Proc$LocalProc.(Proc.java:281)
at hudson.Proc$LocalProc.(Proc.java:218)
at hudson.Launcher$LocalLauncher.launch(Launcher.java:936)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2038)
... 14 more
ERROR: Error cloning remote repo 'origin'
Retrying after 10 seconds
No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone
Cloning repository https://github.com/apache/incubator-hudi.git
 > git init  # 
 > timeout=10
Fetching upstream changes from https://github.com/apache/incubator-hudi.git
 > git --version # timeout

[jira] [Created] (HUDI-285) Implement HoodieStorageWriter based on the metadata

2019-09-26 Thread leesf (Jira)
leesf created HUDI-285:
--

 Summary: Implement HoodieStorageWriter based on the metadata
 Key: HUDI-285
 URL: https://issues.apache.org/jira/browse/HUDI-285
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Write Client
Reporter: leesf
Assignee: leesf
 Fix For: 0.5.1


Currently the _getStorageWriter_ method in HoodieStorageWriterFactory to get 
HoodieStorageWriter is hard code to HoodieParquetWriter since currently only 
parquet is supported for HoodieStorageWriter. However, it is better to 
implement HoodieStorageWriter based on the metadata for extension. And if 
_StorageWriterType_ is emtpy in metadata, the default HoodieParquetWriter is 
returned to not affect the current logic.

cc [~vinoth] [~vbalaji]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-232) Implement sealing/unsealing for HoodieRecord class

2019-09-26 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-232:
--

Assignee: leesf

> Implement sealing/unsealing for HoodieRecord class
> --
>
> Key: HUDI-232
> URL: https://issues.apache.org/jira/browse/HUDI-232
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Write Client
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> HoodieRecord class sometimes is modified to set the record location. We can 
> get into issues like HUDI-170 if the modification is misplaced. We need a 
> mechanism to seal the class and unseal for modification explicity.. Try to 
> modify in sealed state should throw an error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on issue #918: [HUDI-121] : Address comments during RC2 voting

2019-09-26 Thread GitBox
bvaradar commented on issue #918: [HUDI-121] : Address comments during RC2 
voting
URL: https://github.com/apache/incubator-hudi/pull/918#issuecomment-535664328
 
 
   @vinothchandar @lresende : 
   Some Examples of NOTICE and LICENSE from other incubator projects that shows 
these files are generated for source release
   
   1. https://github.com/apache/incubator-gobblin/blob/master/LICENSE
   2. https://github.com/apache/incubator-gobblin/blob/master/NOTICE
   3. https://github.com/apache/incubator-heron/blob/master/LICENSE
   3. https://github.com/apache/incubator-heron/blob/master/NOTICE


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #918: [HUDI-121] : Address comments during RC2 voting

2019-09-26 Thread GitBox
bvaradar commented on issue #918: [HUDI-121] : Address comments during RC2 
voting
URL: https://github.com/apache/incubator-hudi/pull/918#issuecomment-535585438
 
 
   @lresende : Updated based on review comments. Please take a look again and 
review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #918: [HUDI-121] : Address comments during RC2 voting

2019-09-26 Thread GitBox
bvaradar commented on a change in pull request #918: [HUDI-121] : Address 
comments during RC2 voting
URL: https://github.com/apache/incubator-hudi/pull/918#discussion_r328712890
 
 

 ##
 File path: pom.xml
 ##
 @@ -394,18 +369,18 @@
   
 
 
+  NOTICE
   **/.*
-  **/*.txt
-  **/*.sh
-  **/*.log
+  **/*.json
+  **/*.sqltemplate
+  **/compose_env
+  **/*NOTICE*
+  **/*LICENSE*
   **/dependency-reduced-pom.xml
-  **/test/resources/*.avsc
   **/test/resources/*.data
-  **/test/resources/*.schema
-  **/test/resources/*.csv
-  **/main/avro/*.avsc
-  **/target/*
+  **/target/**
   **/style/*
+  **/generated-sources/**
 
 Review comment:
   Fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #918: [HUDI-121] : Address comments during RC2 voting

2019-09-26 Thread GitBox
bvaradar commented on a change in pull request #918: [HUDI-121] : Address 
comments during RC2 voting
URL: https://github.com/apache/incubator-hudi/pull/918#discussion_r328712793
 
 

 ##
 File path: NOTICE
 ##
 @@ -1,120 +1,574 @@
 Apache HUDI
 Copyright 2019 The Apache Software Foundation
 
-Licensed under the Apache License, Version 2.0 (the "License"); you may not 
use this file except in compliance with the License. You may obtain a copy of 
the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software 
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 
License for the specific language governing permissions and limitations under 
the License.
-
-This project includes:
-  An open source Java toolkit for Amazon S3 under Apache License, Version 2.0
-  ANTLR StringTemplate 4.0.2 under BSD licence
-  Apache Commons Collections under Apache License, Version 2.0
-  Apache Hadoop Annotations under Apache License, Version 2.0
-  Apache HttpClient under Apache License, Version 2.0
-  Apache HttpCore under Apache License, Version 2.0
-  Apache Kafka under The Apache Software License, Version 2.0
-  Apache Log4j under The Apache Software License, Version 2.0
-  Apache Parquet Avro under The Apache Software License, Version 2.0
-  ASM Core under 3-Clause BSD License
-  bijection-avro under Apache 2
-  bijection-core under Apache 2
-  Commons BeanUtils Core under The Apache Software License, Version 2.0
-  Commons CLI under The Apache Software License, Version 2.0
-  Commons Codec under The Apache Software License, Version 2.0
-  Commons Compress under The Apache Software License, Version 2.0
-  Commons Configuration under The Apache Software License, Version 2.0
-  Commons Daemon under The Apache Software License, Version 2.0
-  Commons IO under The Apache Software License, Version 2.0
-  Commons Lang under The Apache Software License, Version 2.0
-  Commons Logging under The Apache Software License, Version 2.0
-  Commons Math under The Apache Software License, Version 2.0
-  Commons Net under The Apache Software License, Version 2.0
-  commons-beanutils under Apache License
-  Curator Client under The Apache Software License, Version 2.0
-  Curator Framework under The Apache Software License, Version 2.0
-  Curator Recipes under The Apache Software License, Version 2.0
-  Data Mapper for Jackson under The Apache Software License, Version 2.0
-  Digester under The Apache Software License, Version 2.0
-  FindBugs-jsr305 under The Apache Software License, Version 2.0
-  Fluent API for Apache HttpClient under Apache License, Version 2.0
-  Graphite Integration for Metrics under Apache License 2.0
-  Guava: Google Core Libraries for Java under The Apache Software License, 
Version 2.0
-  Hive Common under The Apache Software License, Version 2.0
-  Hive JDBC under The Apache Software License, Version 2.0
-  Hive Metastore under The Apache Software License, Version 2.0
-  Hive Service under The Apache Software License, Version 2.0
-  Hive Service RPC under The Apache Software License, Version 2.0
-  htrace-core under The Apache Software License, Version 2.0
-  HttpClient under Apache License
-  hudi-client under Apache License, Version 2.0
-  hudi-common under Apache License, Version 2.0
-  hudi-hadoop-mr under Apache License, Version 2.0
-  hudi-hive under Apache License, Version 2.0
-  hudi-spark under Apache License, Version 2.0
-  hudi-timeline-service under Apache License, Version 2.0
-  hudi-utilities under Apache License, Version 2.0
-  IntelliJ IDEA Annotations under The Apache Software License, Version 2.0
-  io.confluent:common-config under Apache License, Version 2.0
-  io.confluent:common-utils under Apache License, Version 2.0
-  io.confluent:kafka-avro-serializer under Apache License, Version 2.0
-  io.confluent:kafka-schema-registry-client under Apache License, Version 2.0
-  Jackson under The Apache Software License, Version 2.0
-  Jackson-annotations under The Apache Software License, Version 2.0
-  Jackson-core under The Apache Software License, Version 2.0
-  jackson-databind under The Apache Software License, Version 2.0
-  Java Servlet API under CDDL + GPLv2 with classpath exception
-  java-xmlbuilder under Apache License, Version 2.0
-  Javalin under The Apache Software License, Version 2.0
-  JAX-RS provider for JSON content type under The Apache Software License, 
Version 2.0 or GNU Lesser General Public License (LGPL), Version 2.1
-  JAXB RI under CDDL 1.1 or GPL2 w/ CPE
-  jcommander under Apache 2.0
-  jersey-core under CDDL 1.1 or GPL2 w/ CPE
-  jersey-json under CDDL 1.1 or GPL2 w/ CPE
-  jersey-server under CDDL 1.1 or GPL2 w/ CPE
-  Jettison under Apache License, Version 2.0
-  Jetty :: Asynchronous HTTP Client under Apache Software License - Version 
2.0 or Eclipse Public License - Version 1.0
-  

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #918: [HUDI-121] : Address comments during RC2 voting

2019-09-26 Thread GitBox
bvaradar commented on a change in pull request #918: [HUDI-121] : Address 
comments during RC2 voting
URL: https://github.com/apache/incubator-hudi/pull/918#discussion_r328712247
 
 

 ##
 File path: LICENSE
 ##
 @@ -175,32 +175,6 @@
   of your accepting any such warranty or additional liability.
 
END OF TERMS AND CONDITIONS
-
-   APPENDIX: How to apply the Apache License to your work.
-
-  To apply the Apache License to your work, attach the following
-  boilerplate notice, with the fields enclosed by brackets "[]"
-  replaced with your own identifying information. (Don't include
-  the brackets!)  The text should be enclosed in the appropriate
-  comment syntax for the file format. We also recommend that a
-  file or class name and description of purpose be included on the
-  same "printed page" as the copyright notice for easier
-  identification within third-party archives.
-
-   Copyright [] [name of copyright owner]
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
-
 
 This project bundles org.slf4j:slf4j-api, org.slf4j:slf4j-log4j under the 
terms of the MIT license.
 
 Review comment:
   @lresende : Cleaned it up. I looked at other incubator project's LICENSE 
files and they all have only licenses for source code contributions (part of 
source release). I have made the changes to be consistent with this. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-232) Implement sealing/unsealing for HoodieRecord class

2019-09-26 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938783#comment-16938783
 ] 

Vinoth Chandar commented on HUDI-232:
-

+1 . 

> Implement sealing/unsealing for HoodieRecord class
> --
>
> Key: HUDI-232
> URL: https://issues.apache.org/jira/browse/HUDI-232
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Write Client
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Priority: Major
>
> HoodieRecord class sometimes is modified to set the record location. We can 
> get into issues like HUDI-170 if the modification is misplaced. We need a 
> mechanism to seal the class and unseal for modification explicity.. Try to 
> modify in sealed state should throw an error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-279) Regression in Schema Evolution due to PR-755

2019-09-26 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-279.
--
Resolution: Fixed

Fixed via master: 2ea8b0c3f1eeb19f4dc1e9946331c8fd93e6daab

> Regression in Schema Evolution due to PR-755
> 
>
> Key: HUDI-279
> URL: https://issues.apache.org/jira/browse/HUDI-279
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Reported by Alex:
> [https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L200]
> this sets an Avro Schema on the config
>  
> but I see that AvroReadSupport.init is getting a different config instance, 
> with avro schema set to null and falls back to what is in parquet. Which 
> breaks during the old/new data merge. I’m pretty sure it worked before as we 
> had successful schema evolutions. Any idea why it might be happening? 
>  
> Caused by changes in :
> [https://github.com/apache/incubator-hudi/pull/755]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-265) Failed to delete tmp dirs created in unit tests

2019-09-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-265:

Labels: pull-request-available  (was: )

> Failed to delete tmp dirs created in unit tests
> ---
>
> Key: HUDI-265
> URL: https://issues.apache.org/jira/browse/HUDI-265
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> In some unit tests, such as TestHoodieSnapshotCopier, TestUpdateMapFunction.  
> After run these tests, it fails to delete tmp dir created in _init(with 
> before annotation)_ after clean(with after annotation), thus will cause too 
> many folders in /tmp. we need to delete these dirs after finishing ut.
> I will go through all the unit tests that did not properly delete the tmp dir 
> and send a patch.
>  
> cc [~vinoth] [~vbalaji]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf opened a new pull request #928: [HUDI-265] Failed to delete tmp dirs created in unit tests

2019-09-26 Thread GitBox
leesf opened a new pull request #928: [HUDI-265] Failed to delete tmp dirs 
created in unit tests
URL: https://github.com/apache/incubator-hudi/pull/928
 
 
   see 
[jira-265](https://jira.apache.org/jira/projects/HUDI/issues/HUDI-265?filter=allopenissues)
   
   cc @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-269) Provide ability to throttle DeltaStreamer sync runs

2019-09-26 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938543#comment-16938543
 ] 

Balaji Varadarajan commented on HUDI-269:
-

cc [~vinoth]. We will look into this

> Provide ability to throttle DeltaStreamer sync runs
> ---
>
> Key: HUDI-269
> URL: https://issues.apache.org/jira/browse/HUDI-269
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Balaji Varadarajan
>Assignee: Xing Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: hudi_request_test.tar.gz, 
> image-2019-09-25-08-51-19-686.png, image-2019-09-26-09-02-24-761.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Copied from [https://github.com/apache/incubator-hudi/issues/922]
> In some scenario in our cluster, we may want delta streamer to slow down a 
> bit.
> so it's nice to have a parameter to control the min sync interval of each 
> sync in continuous mode.
> this param is default to 0, so this should not affect current logic.
> minor pr: [#921|https://github.com/apache/incubator-hudi/pull/921]
> the main reason we want to slow it down is that aws s3 is charged by s3 
> get/put/list requests. we don't want to pay for too many requests for a 
> really slow change table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] taherk77 commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-26 Thread GitBox
taherk77 commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-535381708
 
 
   Also can we write the incremental column along with the last_val to the 
checkpoint? I am saying this because let's say for run 1 we do incremental 
column as "contract_id" and let's say the max value written was 8, and then for 
run 2 I change the params to use incremental column as "contract_created_at" 
then the ppd query will be "select * from contract where contract_created_at > 
"8"" which is incorrect and will fail!. So if we write the column we first very 
the column and only then run or else we can straight away throw an exception 
and exit. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] taherk77 commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer

2019-09-26 Thread GitBox
taherk77 commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-535380366
 
 
   Hi @vinothchandar I have created what we discussed. If checkpoint found we 
reference the checkpoint or else do a whole pull. Have also added a few other 
options and tested through JUNIT locally.
   
   I am really unsure how to mock the MySQL database in the JUNIT test! Can you 
give me some pointers on that please it would really be helpful!? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services