Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-03 Thread Rajesh Mahindra
A really awesome release, kudos to everyone!

On Fri, Sep 3, 2021 at 20:19 leesf  wrote:

> Thanks Udit for driving the release, Great news!
>
> Rubens Rodrigues  于2021年9月4日周六 上午10:12写道:
>
> > Hello
> >
> > Im from Brazil and Im follow hudi since version 0.5, congratulations for
> > everyone, The hudi evolution in only one year is impressive.
> >
> > Me and my folks are very happy to choose hudi for our datalake.
> >
> > Thank you so much for this wonderfull work
> >
> > Em sex., 3 de set. de 2021 22:57, Raymond Xu <
> xu.shiyan.raym...@gmail.com>
> > escreveu:
> >
> > > Congrats! Another awesome release.
> > >
> > > On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma <
> pratyaks...@gmail.com>
> > > wrote:
> > >
> > > > Great news! This one really feels like a major release with so many
> > good
> > > > features getting added. :)
> > > >
> > > > On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra 
> > wrote:
> > > >
> > > > > The Apache Hudi team is pleased to announce the release of Apache
> > Hudi
> > > > > 0.9.0.
> > > > >
> > > > > This release comes almost 5 months after 0.8.0. It includes 387
> > > resolved
> > > > > issues, comprising new features as well as
> > > > > general improvements and bug-fixes. Here are a few quick
> highlights:
> > > > >
> > > > > *Spark SQL DML and DDL Support*
> > > > > We have added experimental support for DDL/DML using Spark SQL
> > taking a
> > > > > huge step towards making Hudi more
> > > > > easily accessible and operable by all personas (non-engineers,
> > analysts
> > > > > etc). Users can now use SQL statements like
> > > > > "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> > > > > create/manage tables in catalogs like Hive,
> > > > > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and
> "DELETE"
> > > > > statements to manipulate data.
> > > > > For more information, checkout our docs here
> > > > >  clicking on the
> > > > SparkSQL
> > > > > tab.
> > > > >
> > > > > *Query Side Improvements*
> > > > > Hudi tables are now registered with Hive as spark datasource
> tables,
> > > > > meaning Spark SQL on these tables now uses the
> > > > > datasource as well, instead of relying on the Hive fallbacks within
> > > > Spark,
> > > > > which are ill-maintained/cumbersome. This
> > > > > unlocks many optimizations such as the use of Hudi's own FileIndex
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> > > > > >
> > > > > implementation for optimized caching and the use
> > > > > of the Hudi metadata table, for faster listing of large tables. We
> > have
> > > > > also added support for time travel query
> > > > >  >,
> > > for
> > > > > spark
> > > > > datasource.
> > > > >
> > > > > *Writer Side Improvements*
> > > > > This release has several major writer side improvements. Virtual
> key
> > > > > support has been added to avoid populating meta
> > > > > fields and leverage existing fields to populate record keys and
> > > partition
> > > > > paths.
> > > > > Bulk Insert operation using row writer is now enabled by default
> for
> > > > faster
> > > > > inserts.
> > > > > Hudi's automatic cleaning of uncommitted data has been enhanced to
> be
> > > > > performant over cloud stores. You can learn
> > > > > more about this new centrally coordinated marker mechanism in this
> > blog
> > > > > <
> https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/
> > >.
> > > > > Async Clustering support has been added to both DeltaStreamer and
> > Spark
> > > > > Structured Streaming Sink. More on this
> > > > > can be found in this blog
> > > > > .
> > > > > Users can choose to drop fields used to generate partition paths.
> > > > > Added a new write operation "delete_partition" support in spark.
> > Users
> > > > can
> > > > > leverage this to delete older partitions in
> > > > > bulk, in addition to record level deletes.
> > > > > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage
> > > format,
> > > > > Baidu BOS storage in Hudi.
> > > > > A pre commit validator framework
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> > > > > >
> > > > > has been added for spark engine, which can be used for
> DeltaStreamer
> > > and
> > > > > Spark
> > > > > Datasource writers. Users can leverage this to add any validations
> to
> > > be
> > > > > executed before committing writes to Hudi.
> > > > > Few out of the box validators are available like
> > > > > SqlQueryEqualityPreCommitValidator
> > > > > <
> > > > >
> > > >
> > >
> >
> 

Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-03 Thread leesf
Thanks Udit for driving the release, Great news!

Rubens Rodrigues  于2021年9月4日周六 上午10:12写道:

> Hello
>
> Im from Brazil and Im follow hudi since version 0.5, congratulations for
> everyone, The hudi evolution in only one year is impressive.
>
> Me and my folks are very happy to choose hudi for our datalake.
>
> Thank you so much for this wonderfull work
>
> Em sex., 3 de set. de 2021 22:57, Raymond Xu 
> escreveu:
>
> > Congrats! Another awesome release.
> >
> > On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma 
> > wrote:
> >
> > > Great news! This one really feels like a major release with so many
> good
> > > features getting added. :)
> > >
> > > On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra 
> wrote:
> > >
> > > > The Apache Hudi team is pleased to announce the release of Apache
> Hudi
> > > > 0.9.0.
> > > >
> > > > This release comes almost 5 months after 0.8.0. It includes 387
> > resolved
> > > > issues, comprising new features as well as
> > > > general improvements and bug-fixes. Here are a few quick highlights:
> > > >
> > > > *Spark SQL DML and DDL Support*
> > > > We have added experimental support for DDL/DML using Spark SQL
> taking a
> > > > huge step towards making Hudi more
> > > > easily accessible and operable by all personas (non-engineers,
> analysts
> > > > etc). Users can now use SQL statements like
> > > > "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> > > > create/manage tables in catalogs like Hive,
> > > > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> > > > statements to manipulate data.
> > > > For more information, checkout our docs here
> > > >  clicking on the
> > > SparkSQL
> > > > tab.
> > > >
> > > > *Query Side Improvements*
> > > > Hudi tables are now registered with Hive as spark datasource tables,
> > > > meaning Spark SQL on these tables now uses the
> > > > datasource as well, instead of relying on the Hive fallbacks within
> > > Spark,
> > > > which are ill-maintained/cumbersome. This
> > > > unlocks many optimizations such as the use of Hudi's own FileIndex
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> > > > >
> > > > implementation for optimized caching and the use
> > > > of the Hudi metadata table, for faster listing of large tables. We
> have
> > > > also added support for time travel query
> > > > ,
> > for
> > > > spark
> > > > datasource.
> > > >
> > > > *Writer Side Improvements*
> > > > This release has several major writer side improvements. Virtual key
> > > > support has been added to avoid populating meta
> > > > fields and leverage existing fields to populate record keys and
> > partition
> > > > paths.
> > > > Bulk Insert operation using row writer is now enabled by default for
> > > faster
> > > > inserts.
> > > > Hudi's automatic cleaning of uncommitted data has been enhanced to be
> > > > performant over cloud stores. You can learn
> > > > more about this new centrally coordinated marker mechanism in this
> blog
> > > >  >.
> > > > Async Clustering support has been added to both DeltaStreamer and
> Spark
> > > > Structured Streaming Sink. More on this
> > > > can be found in this blog
> > > > .
> > > > Users can choose to drop fields used to generate partition paths.
> > > > Added a new write operation "delete_partition" support in spark.
> Users
> > > can
> > > > leverage this to delete older partitions in
> > > > bulk, in addition to record level deletes.
> > > > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage
> > format,
> > > > Baidu BOS storage in Hudi.
> > > > A pre commit validator framework
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> > > > >
> > > > has been added for spark engine, which can be used for DeltaStreamer
> > and
> > > > Spark
> > > > Datasource writers. Users can leverage this to add any validations to
> > be
> > > > executed before committing writes to Hudi.
> > > > Few out of the box validators are available like
> > > > SqlQueryEqualityPreCommitValidator
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> > > > >,
> > > > SqlQueryInequalityPreCommitValidator
> > > > <
> > > >
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> > > > >

Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-03 Thread Rubens Rodrigues
Hello

Im from Brazil and Im follow hudi since version 0.5, congratulations for
everyone, The hudi evolution in only one year is impressive.

Me and my folks are very happy to choose hudi for our datalake.

Thank you so much for this wonderfull work

Em sex., 3 de set. de 2021 22:57, Raymond Xu 
escreveu:

> Congrats! Another awesome release.
>
> On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma 
> wrote:
>
> > Great news! This one really feels like a major release with so many good
> > features getting added. :)
> >
> > On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra  wrote:
> >
> > > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > > 0.9.0.
> > >
> > > This release comes almost 5 months after 0.8.0. It includes 387
> resolved
> > > issues, comprising new features as well as
> > > general improvements and bug-fixes. Here are a few quick highlights:
> > >
> > > *Spark SQL DML and DDL Support*
> > > We have added experimental support for DDL/DML using Spark SQL taking a
> > > huge step towards making Hudi more
> > > easily accessible and operable by all personas (non-engineers, analysts
> > > etc). Users can now use SQL statements like
> > > "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> > > create/manage tables in catalogs like Hive,
> > > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> > > statements to manipulate data.
> > > For more information, checkout our docs here
> > >  clicking on the
> > SparkSQL
> > > tab.
> > >
> > > *Query Side Improvements*
> > > Hudi tables are now registered with Hive as spark datasource tables,
> > > meaning Spark SQL on these tables now uses the
> > > datasource as well, instead of relying on the Hive fallbacks within
> > Spark,
> > > which are ill-maintained/cumbersome. This
> > > unlocks many optimizations such as the use of Hudi's own FileIndex
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> > > >
> > > implementation for optimized caching and the use
> > > of the Hudi metadata table, for faster listing of large tables. We have
> > > also added support for time travel query
> > > ,
> for
> > > spark
> > > datasource.
> > >
> > > *Writer Side Improvements*
> > > This release has several major writer side improvements. Virtual key
> > > support has been added to avoid populating meta
> > > fields and leverage existing fields to populate record keys and
> partition
> > > paths.
> > > Bulk Insert operation using row writer is now enabled by default for
> > faster
> > > inserts.
> > > Hudi's automatic cleaning of uncommitted data has been enhanced to be
> > > performant over cloud stores. You can learn
> > > more about this new centrally coordinated marker mechanism in this blog
> > > .
> > > Async Clustering support has been added to both DeltaStreamer and Spark
> > > Structured Streaming Sink. More on this
> > > can be found in this blog
> > > .
> > > Users can choose to drop fields used to generate partition paths.
> > > Added a new write operation "delete_partition" support in spark. Users
> > can
> > > leverage this to delete older partitions in
> > > bulk, in addition to record level deletes.
> > > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage
> format,
> > > Baidu BOS storage in Hudi.
> > > A pre commit validator framework
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> > > >
> > > has been added for spark engine, which can be used for DeltaStreamer
> and
> > > Spark
> > > Datasource writers. Users can leverage this to add any validations to
> be
> > > executed before committing writes to Hudi.
> > > Few out of the box validators are available like
> > > SqlQueryEqualityPreCommitValidator
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> > > >,
> > > SqlQueryInequalityPreCommitValidator
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> > > >
> > > and SqlQuerySingleResultPreCommitValidator
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java
> > > >
> > > .
> > >
> > > *Flink Integration Improvements*
> > > The Flink writer now supports propagation of 

Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-03 Thread Raymond Xu
Congrats! Another awesome release.

On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma 
wrote:

> Great news! This one really feels like a major release with so many good
> features getting added. :)
>
> On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra  wrote:
>
> > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > 0.9.0.
> >
> > This release comes almost 5 months after 0.8.0. It includes 387 resolved
> > issues, comprising new features as well as
> > general improvements and bug-fixes. Here are a few quick highlights:
> >
> > *Spark SQL DML and DDL Support*
> > We have added experimental support for DDL/DML using Spark SQL taking a
> > huge step towards making Hudi more
> > easily accessible and operable by all personas (non-engineers, analysts
> > etc). Users can now use SQL statements like
> > "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> > create/manage tables in catalogs like Hive,
> > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> > statements to manipulate data.
> > For more information, checkout our docs here
> >  clicking on the
> SparkSQL
> > tab.
> >
> > *Query Side Improvements*
> > Hudi tables are now registered with Hive as spark datasource tables,
> > meaning Spark SQL on these tables now uses the
> > datasource as well, instead of relying on the Hive fallbacks within
> Spark,
> > which are ill-maintained/cumbersome. This
> > unlocks many optimizations such as the use of Hudi's own FileIndex
> > <
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> > >
> > implementation for optimized caching and the use
> > of the Hudi metadata table, for faster listing of large tables. We have
> > also added support for time travel query
> > , for
> > spark
> > datasource.
> >
> > *Writer Side Improvements*
> > This release has several major writer side improvements. Virtual key
> > support has been added to avoid populating meta
> > fields and leverage existing fields to populate record keys and partition
> > paths.
> > Bulk Insert operation using row writer is now enabled by default for
> faster
> > inserts.
> > Hudi's automatic cleaning of uncommitted data has been enhanced to be
> > performant over cloud stores. You can learn
> > more about this new centrally coordinated marker mechanism in this blog
> > .
> > Async Clustering support has been added to both DeltaStreamer and Spark
> > Structured Streaming Sink. More on this
> > can be found in this blog
> > .
> > Users can choose to drop fields used to generate partition paths.
> > Added a new write operation "delete_partition" support in spark. Users
> can
> > leverage this to delete older partitions in
> > bulk, in addition to record level deletes.
> > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format,
> > Baidu BOS storage in Hudi.
> > A pre commit validator framework
> > <
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> > >
> > has been added for spark engine, which can be used for DeltaStreamer and
> > Spark
> > Datasource writers. Users can leverage this to add any validations to be
> > executed before committing writes to Hudi.
> > Few out of the box validators are available like
> > SqlQueryEqualityPreCommitValidator
> > <
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> > >,
> > SqlQueryInequalityPreCommitValidator
> > <
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> > >
> > and SqlQuerySingleResultPreCommitValidator
> > <
> >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java
> > >
> > .
> >
> > *Flink Integration Improvements*
> > The Flink writer now supports propagation of CDC format for MOR table, by
> > turning on the option "changelog.enabled=true".
> > Hudi would then persist all change flags of each record, allowing users
> to
> > do stateful computation based on these change logs.
> > Flink writing is now close to feature parity with spark writing, with
> > addition of write operations like "bulk_insert" and
> > "insert_overwrite", support for non-partitioned tables, automatic cleanup
> > of uncommitted data, global indexing support, hive
> > style partitioning and handling of partition 

Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-01 Thread Pratyaksh Sharma
Great news! This one really feels like a major release with so many good
features getting added. :)

On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra  wrote:

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.9.0.
>
> This release comes almost 5 months after 0.8.0. It includes 387 resolved
> issues, comprising new features as well as
> general improvements and bug-fixes. Here are a few quick highlights:
>
> *Spark SQL DML and DDL Support*
> We have added experimental support for DDL/DML using Spark SQL taking a
> huge step towards making Hudi more
> easily accessible and operable by all personas (non-engineers, analysts
> etc). Users can now use SQL statements like
> "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> create/manage tables in catalogs like Hive,
> and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> statements to manipulate data.
> For more information, checkout our docs here
>  clicking on the SparkSQL
> tab.
>
> *Query Side Improvements*
> Hudi tables are now registered with Hive as spark datasource tables,
> meaning Spark SQL on these tables now uses the
> datasource as well, instead of relying on the Hive fallbacks within Spark,
> which are ill-maintained/cumbersome. This
> unlocks many optimizations such as the use of Hudi's own FileIndex
> <
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> >
> implementation for optimized caching and the use
> of the Hudi metadata table, for faster listing of large tables. We have
> also added support for time travel query
> , for
> spark
> datasource.
>
> *Writer Side Improvements*
> This release has several major writer side improvements. Virtual key
> support has been added to avoid populating meta
> fields and leverage existing fields to populate record keys and partition
> paths.
> Bulk Insert operation using row writer is now enabled by default for faster
> inserts.
> Hudi's automatic cleaning of uncommitted data has been enhanced to be
> performant over cloud stores. You can learn
> more about this new centrally coordinated marker mechanism in this blog
> .
> Async Clustering support has been added to both DeltaStreamer and Spark
> Structured Streaming Sink. More on this
> can be found in this blog
> .
> Users can choose to drop fields used to generate partition paths.
> Added a new write operation "delete_partition" support in spark. Users can
> leverage this to delete older partitions in
> bulk, in addition to record level deletes.
> Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format,
> Baidu BOS storage in Hudi.
> A pre commit validator framework
> <
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> >
> has been added for spark engine, which can be used for DeltaStreamer and
> Spark
> Datasource writers. Users can leverage this to add any validations to be
> executed before committing writes to Hudi.
> Few out of the box validators are available like
> SqlQueryEqualityPreCommitValidator
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> >,
> SqlQueryInequalityPreCommitValidator
> <
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> >
> and SqlQuerySingleResultPreCommitValidator
> <
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java
> >
> .
>
> *Flink Integration Improvements*
> The Flink writer now supports propagation of CDC format for MOR table, by
> turning on the option "changelog.enabled=true".
> Hudi would then persist all change flags of each record, allowing users to
> do stateful computation based on these change logs.
> Flink writing is now close to feature parity with spark writing, with
> addition of write operations like "bulk_insert" and
> "insert_overwrite", support for non-partitioned tables, automatic cleanup
> of uncommitted data, global indexing support, hive
> style partitioning and handling of partition path updates.
> Writing also supports a new log append mode, where no records are
> de-duplicated and base files are directly written for each flush.
> Flink readers now support streaming reads from COW/MOR tables. Deletions
> are emitted by default in streaming read mode, the
> downstream receives 

[ANNOUNCE] Apache Hudi 0.9.0 released

2021-08-31 Thread Udit Mehrotra
The Apache Hudi team is pleased to announce the release of Apache Hudi
0.9.0.

This release comes almost 5 months after 0.8.0. It includes 387 resolved
issues, comprising new features as well as
general improvements and bug-fixes. Here are a few quick highlights:

*Spark SQL DML and DDL Support*
We have added experimental support for DDL/DML using Spark SQL taking a
huge step towards making Hudi more
easily accessible and operable by all personas (non-engineers, analysts
etc). Users can now use SQL statements like
"CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
create/manage tables in catalogs like Hive,
and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
statements to manipulate data.
For more information, checkout our docs here
 clicking on the SparkSQL
tab.

*Query Side Improvements*
Hudi tables are now registered with Hive as spark datasource tables,
meaning Spark SQL on these tables now uses the
datasource as well, instead of relying on the Hive fallbacks within Spark,
which are ill-maintained/cumbersome. This
unlocks many optimizations such as the use of Hudi's own FileIndex

implementation for optimized caching and the use
of the Hudi metadata table, for faster listing of large tables. We have
also added support for time travel query
, for
spark
datasource.

*Writer Side Improvements*
This release has several major writer side improvements. Virtual key
support has been added to avoid populating meta
fields and leverage existing fields to populate record keys and partition
paths.
Bulk Insert operation using row writer is now enabled by default for faster
inserts.
Hudi's automatic cleaning of uncommitted data has been enhanced to be
performant over cloud stores. You can learn
more about this new centrally coordinated marker mechanism in this blog
.
Async Clustering support has been added to both DeltaStreamer and Spark
Structured Streaming Sink. More on this
can be found in this blog
.
Users can choose to drop fields used to generate partition paths.
Added a new write operation "delete_partition" support in spark. Users can
leverage this to delete older partitions in
bulk, in addition to record level deletes.
Added Support for Huawei Cloud Object Storage, BAIDU AFS storage format,
Baidu BOS storage in Hudi.
A pre commit validator framework

has been added for spark engine, which can be used for DeltaStreamer and
Spark
Datasource writers. Users can leverage this to add any validations to be
executed before committing writes to Hudi.
Few out of the box validators are available like
SqlQueryEqualityPreCommitValidator
,
SqlQueryInequalityPreCommitValidator

and SqlQuerySingleResultPreCommitValidator

.

*Flink Integration Improvements*
The Flink writer now supports propagation of CDC format for MOR table, by
turning on the option "changelog.enabled=true".
Hudi would then persist all change flags of each record, allowing users to
do stateful computation based on these change logs.
Flink writing is now close to feature parity with spark writing, with
addition of write operations like "bulk_insert" and
"insert_overwrite", support for non-partitioned tables, automatic cleanup
of uncommitted data, global indexing support, hive
style partitioning and handling of partition path updates.
Writing also supports a new log append mode, where no records are
de-duplicated and base files are directly written for each flush.
Flink readers now support streaming reads from COW/MOR tables. Deletions
are emitted by default in streaming read mode, the
downstream receives the "DELETE" message as a Hoodie record with empty
payload.
Hive sync has been improved by adding support for different Hive versions
and asynchronous execution.
Flink Streamer tool now supports transformers.

*DeltaStreamer Improvements*
We have enhanced DeltaStreamer utility with 3 new sources. JDBC