This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 667b1fb55e [DOCS][MINOR] Fixing migration guide for 0.12.0 release page (#7330) 667b1fb55e is described below commit 667b1fb55e55b86b04b50035e7dedf4b3e8df6d0 Author: Sivabalan Narayanan <n.siv...@gmail.com> AuthorDate: Tue Nov 29 12:00:00 2022 -0800 [DOCS][MINOR] Fixing migration guide for 0.12.0 release page (#7330) --- website/releases/release-0.12.0.md | 110 ++++++++++++++++++------------------- 1 file changed, 55 insertions(+), 55 deletions(-) diff --git a/website/releases/release-0.12.0.md b/website/releases/release-0.12.0.md index c76781d037..07daa59b2a 100644 --- a/website/releases/release-0.12.0.md +++ b/website/releases/release-0.12.0.md @@ -7,6 +7,61 @@ last_modified_at: 2022-08-17T10:30:00+05:30 --- # [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) ([docs](/docs/quick-start-guide)) +## Migration Guide + +In this release, there have been a few API and configuration updates listed below that warranted a new table version. +Hence, the latest [table version](https://github.com/apache/hudi/blob/bf86efef719b7760ea379bfa08c537431eeee09a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableVersion.java#L41) +is `5`. For existing Hudi tables on older version, a one-time upgrade step will be executed automatically. Please take +note of the following updates before upgrading to Hudi 0.12.0. + +### Configuration Updates + +In this release, the default value for a few configurations have been changed. They are as follows: + +- `hoodie.bulkinsert.sort.mode`: This config is used to determine mode for sorting records for bulk insert. Its default value has been changed from `GLOBAL_SORT` to `NONE`, which means no sorting is done and it matches `spark.write.parquet()` in terms of overhead. +- `hoodie.datasource.hive_sync.partition_value_extractor`: This config is used to extract and transform partition value during Hive sync. Its default value has been changed from `SlashEncodedDayPartitionValueExtractor` to `MultiPartKeysValueExtractor`. If you relied on the previous default value (i.e., have not set it explicitly), you are required to set the config to `org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`. From this release, if this config is not set and Hive sync [...] +- The following configs will be inferred, if not set manually, from other configs' values: + - `META_SYNC_BASE_FILE_FORMAT`: infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT` + + - `META_SYNC_ASSUME_DATE_PARTITION`: infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING` + + - `META_SYNC_DECODE_PARTITION`: infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING` + + - `META_SYNC_USE_FILE_LISTING_FROM_METADATA`: infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE` + +### API Updates + +In `SparkKeyGeneratorInterface`, return type of the `getRecordKey` API has been changed from String to UTF8String. +```java +// Before +String getRecordKey(InternalRow row, StructType schema); + + +// After +UTF8String getRecordKey(InternalRow row, StructType schema); +``` + +### Fallback Partition + +If partition field value was null, Hudi has a fallback mechanism instead of failing the write. Until 0.9.0, +`__HIVE_DEFAULT_PARTITION__` was used as the fallback partition. After 0.9.0, due to some refactoring, fallback +partition changed to `default`. This default partition does not sit well with some of the query engines. So, we are +switching the fallback partition to `__HIVE_DEFAULT_PARTITION__` from 0.12.0. We have added an upgrade step where in, +we fail the upgrade if the existing Hudi table has a partition named `default`. Users are expected to rewrite the data +in this partition to a partition named [\_\_HIVE_DEFAULT_PARTITION\_\_](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-common/src/main/java/org/apache/hudi/common/util/PartitionPathEncodeUtils.java#L29). +However, if you had intentionally named your partition as `default`, you can bypass this using the config `hoodie.skip.default.partition.validation`. + +### Bundle Updates + +- `hudi-aws-bundle` extracts away aws-related dependencies from hudi-utilities-bundle or hudi-spark-bundle. In order to use features such as Glue sync, Cloudwatch metrics reporter or DynamoDB lock provider, users need to provide hudi-aws-bundle jar along with hudi-utilities-bundle or hudi-spark-bundle jars. +- Spark 3.3 support is added; users who are on Spark 3.3 can use `hudi-spark3.3-bundle` or `hudi-spark3-bundle` (legacy bundle name). +- Spark 3.2 will continue to be supported via `hudi-spark3.2-bundle`. +- Spark 3.1 will continue to be supported via `hudi-spark3.1-bundle`. +- Spark 2.4 will continue to be supported via `hudi-spark2.4-bundle` or `hudi-spark-bundle` (legacy bundle name). +- Flink 1.15 support is added; users who are on Flink 1.15 can use `hudi-flink1.15-bundle`. +- Flink 1.14 will continue to be supported via `hudi-flink1.14-bundle`. +- Flink 1.13 will continue to be supported via `hudi-flink1.13-bundle`. + ## Release Highlights ### Presto-Hudi Connector @@ -105,61 +160,6 @@ This version brings more improvements to make Hudi the most performant lake stor We recently benchmarked Hudi against TPC-DS workload. Please check out [our blog](/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks) for more details. -### Migration Guide - -In this release, there have been a few API and configuration updates listed below that warranted a new table version. -Hence, the latest [table version](https://github.com/apache/hudi/blob/bf86efef719b7760ea379bfa08c537431eeee09a/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableVersion.java#L41) -is `5`. For existing Hudi tables on older version, a one-time upgrade step will be executed automatically. Please take -note of the following updates before upgrading to Hudi 0.12.0. - -#### Configuration Updates - -In this release, the default value for a few configurations have been changed. They are as follows: - -- `hoodie.bulkinsert.sort.mode`: This config is used to determine mode for sorting records for bulk insert. Its default value has been changed from `GLOBAL_SORT` to `NONE`, which means no sorting is done and it matches `spark.write.parquet()` in terms of overhead. -- `hoodie.datasource.hive_sync.partition_value_extractor`: This config is used to extract and transform partition value during Hive sync. Its default value has been changed from `SlashEncodedDayPartitionValueExtractor` to `MultiPartKeysValueExtractor`. If you relied on the previous default value (i.e., have not set it explicitly), you are required to set the config to `org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`. From this release, if this config is not set and Hive sync [...] -- The following configs will be inferred, if not set manually, from other configs' values: - - `META_SYNC_BASE_FILE_FORMAT`: infer from `org.apache.hudi.common.table.HoodieTableConfig.BASE_FILE_FORMAT` - - - `META_SYNC_ASSUME_DATE_PARTITION`: infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ASSUME_DATE_PARTITIONING` - - - `META_SYNC_DECODE_PARTITION`: infer from `org.apache.hudi.common.table.HoodieTableConfig.URL_ENCODE_PARTITIONING` - - - `META_SYNC_USE_FILE_LISTING_FROM_METADATA`: infer from `org.apache.hudi.common.config.HoodieMetadataConfig.ENABLE` - -#### API Updates - -In `SparkKeyGeneratorInterface`, return type of the `getRecordKey` API has been changed from String to UTF8String. -```java -// Before -String getRecordKey(InternalRow row, StructType schema); - - -// After -UTF8String getRecordKey(InternalRow row, StructType schema); -``` - -#### Fallback Partition - -If partition field value was null, Hudi has a fallback mechanism instead of failing the write. Until 0.9.0, -`__HIVE_DEFAULT_PARTITION__` was used as the fallback partition. After 0.9.0, due to some refactoring, fallback -partition changed to `default`. This default partition does not sit well with some of the query engines. So, we are -switching the fallback partition to `__HIVE_DEFAULT_PARTITION__` from 0.12.0. We have added an upgrade step where in, -we fail the upgrade if the existing Hudi table has a partition named `default`. Users are expected to rewrite the data -in this partition to a partition named [\_\_HIVE_DEFAULT_PARTITION\_\_](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-common/src/main/java/org/apache/hudi/common/util/PartitionPathEncodeUtils.java#L29). -However, if you had intentionally named your partition as `default`, you can bypass this using the config `hoodie.skip.default.partition.validation`. - -#### Bundle Updates - -- `hudi-aws-bundle` extracts away aws-related dependencies from hudi-utilities-bundle or hudi-spark-bundle. In order to use features such as Glue sync, Cloudwatch metrics reporter or DynamoDB lock provider, users need to provide hudi-aws-bundle jar along with hudi-utilities-bundle or hudi-spark-bundle jars. -- Spark 3.3 support is added; users who are on Spark 3.3 can use `hudi-spark3.3-bundle` or `hudi-spark3-bundle` (legacy bundle name). -- Spark 3.2 will continue to be supported via `hudi-spark3.2-bundle`. -- Spark 3.1 will continue to be supported via `hudi-spark3.1-bundle`. -- Spark 2.4 will continue to be supported via `hudi-spark2.4-bundle` or `hudi-spark-bundle` (legacy bundle name). -- Flink 1.15 support is added; users who are on Flink 1.15 can use `hudi-flink1.15-bundle`. -- Flink 1.14 will continue to be supported via `hudi-flink1.14-bundle`. -- Flink 1.13 will continue to be supported via `hudi-flink1.13-bundle`. - ## Known Regressions: We discovered a regression in Hudi 0.12 release related to Bloom