Congrats! Another awesome release. On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma <[email protected]> wrote:
> Great news! This one really feels like a major release with so many good > features getting added. :) > > On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra <[email protected]> wrote: > > > The Apache Hudi team is pleased to announce the release of Apache Hudi > > 0.9.0. > > > > This release comes almost 5 months after 0.8.0. It includes 387 resolved > > issues, comprising new features as well as > > general improvements and bug-fixes. Here are a few quick highlights: > > > > *Spark SQL DML and DDL Support* > > We have added experimental support for DDL/DML using Spark SQL taking a > > huge step towards making Hudi more > > easily accessible and operable by all personas (non-engineers, analysts > > etc). Users can now use SQL statements like > > "CREATE TABLE....USING HUDI" and "CREATE TABLE .. AS SELECT" to > > create/manage tables in catalogs like Hive, > > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE" > > statements to manipulate data. > > For more information, checkout our docs here > > <https://hudi.apache.org/docs/quick-start-guide> clicking on the > SparkSQL > > tab. > > > > *Query Side Improvements* > > Hudi tables are now registered with Hive as spark datasource tables, > > meaning Spark SQL on these tables now uses the > > datasource as well, instead of relying on the Hive fallbacks within > Spark, > > which are ill-maintained/cumbersome. This > > unlocks many optimizations such as the use of Hudi's own FileIndex > > < > > > https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46 > > > > > implementation for optimized caching and the use > > of the Hudi metadata table, for faster listing of large tables. We have > > also added support for time travel query > > <https://hudi.apache.org/docs/quick-start-guide#time-travel-query>, for > > spark > > datasource. > > > > *Writer Side Improvements* > > This release has several major writer side improvements. Virtual key > > support has been added to avoid populating meta > > fields and leverage existing fields to populate record keys and partition > > paths. > > Bulk Insert operation using row writer is now enabled by default for > faster > > inserts. > > Hudi's automatic cleaning of uncommitted data has been enhanced to be > > performant over cloud stores. You can learn > > more about this new centrally coordinated marker mechanism in this blog > > <https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/>. > > Async Clustering support has been added to both DeltaStreamer and Spark > > Structured Streaming Sink. More on this > > can be found in this blog > > <https://hudi.apache.org/blog/2021/08/23/async-clustering/>. > > Users can choose to drop fields used to generate partition paths. > > Added a new write operation "delete_partition" support in spark. Users > can > > leverage this to delete older partitions in > > bulk, in addition to record level deletes. > > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format, > > Baidu BOS storage in Hudi. > > A pre commit validator framework > > < > > > https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java > > > > > has been added for spark engine, which can be used for DeltaStreamer and > > Spark > > Datasource writers. Users can leverage this to add any validations to be > > executed before committing writes to Hudi. > > Few out of the box validators are available like > > SqlQueryEqualityPreCommitValidator > > < > > > https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java > > >, > > SqlQueryInequalityPreCommitValidator > > < > > > https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java > > > > > and SqlQuerySingleResultPreCommitValidator > > < > > > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java > > > > > . > > > > *Flink Integration Improvements* > > The Flink writer now supports propagation of CDC format for MOR table, by > > turning on the option "changelog.enabled=true". > > Hudi would then persist all change flags of each record, allowing users > to > > do stateful computation based on these change logs. > > Flink writing is now close to feature parity with spark writing, with > > addition of write operations like "bulk_insert" and > > "insert_overwrite", support for non-partitioned tables, automatic cleanup > > of uncommitted data, global indexing support, hive > > style partitioning and handling of partition path updates. > > Writing also supports a new log append mode, where no records are > > de-duplicated and base files are directly written for each flush. > > Flink readers now support streaming reads from COW/MOR tables. Deletions > > are emitted by default in streaming read mode, the > > downstream receives the "DELETE" message as a Hoodie record with empty > > payload. > > Hive sync has been improved by adding support for different Hive versions > > and asynchronous execution. > > Flink Streamer tool now supports transformers. > > > > *DeltaStreamer Improvements* > > We have enhanced DeltaStreamer utility with 3 new sources. JDBC > > < > > > https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java > > > > > will help with fetching data from RDBMS sources and > > SQLSource > > < > > > https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/SqlSource.java > > > > > will assist in backfilling use cases. S3EventsHoodieIncrSource > > < > > > https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsHoodieIncrSource.java > > > > > and S3EventsSource > > < > > > https://github.com/apache/hudi/blob/release-0.9.0/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java > > > > > assist in reading data from S3 > > reliably and efficiently ingesting that to Hudi. In addition, we have > added > > support for timestamp based fetch from kafka and added > > basic auth support to schema registry. > > > > Please find more information about the release here: > > https://hudi.apache.org/releases/release-0.9.0 > > > > For details on how to use Hudi, please look at the quick start page > located > > here: > > https://hudi.apache.org/docs/quick-start-guide.html > > > > If you'd like to download the source release, you can find it here: > > https://github.com/apache/hudi/releases/tag/release-0.9.0 > > > > You can read more about the release (including release notes) here: > > > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350027 > > > > We welcome your help and feedback. For more information on how to report > > problems, and to get involved, visit the project > > website at https://hudi.apache.org/ > > > > Thanks to everyone involved! > > > > Udit Mehrotra > > (on behalf of the Hudi Community) > > >
