Apache Hudi 0.7.0 Released

Vinoth Chandar Tue, 26 Jan 2021 14:33:59 -0800

Hello all,

We are excited to share that the 0.7.0 release is out, and by far our
biggest release with lots of code moving around, new unique features and
bug fixes.


Please find more information here and provide feedback
http://hudi.apache.org/releases.html#release-070-docs

Few quick highlights:
Clustering:P <http://hudi.apache.org/releases.html#clustering>0.7.0 brings
the ability to cluster your Hudi tables, to optimize for file sizes and
also storage layout. Hudi will continue to enforce file sizes, as it always
has been, during the write. Clustering provides more flexibility to
increase the file sizes down the line or ability to ingest data at much
fresher intervals, and later coalesce them into bigger files. This is very
similar to the benefits of clustering delivered by cloud data warehouses
<https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html>.
We are proud to announce that such capability is freely available in open
source, for the first time, through the 0.7.0 release.Metadata Table: 0.7.0
lays out the foundation for storing more indexes, metadata in an internal
metadata table, which is implemented using a Hudi MOR table - which means
it’s compacted, cleaned and also incrementally updated like any other Hudi
table. By hoodie.metadata.enable=true on the writer side, will populate the
metadata table with file system listings so all operations don’t have to
explicitly use fs.listStatus() anymore on data partitions. In our testing,
on a large 250K file table, the metadata table delivers 2-3x speedup
<https://github.com/apache/hudi/pull/2441#issuecomment-761742963> over
parallelized listing done by the Hudi spark writer.
Users can also leverage the metadata table on the query side for the
following query paths. For Hive, setting the
hoodie.metadata.enable=true session
property and for SparkSQL on Hive registered tables using --conf
spark.hadoop.hoodie.metadata.enable=true, allows the file listings for the
partition to be fetched out of the metadata table, instead of listing the
underlying DFS partition. More engines are coming.
Java/Flink Writers: In 0.7.0, we have additionally added Java and Flink
based writers, as initial steps. Specifically, the HoodieFlinkStreamer allows
for Hudi Copy-On-Write table to be built by streaming data from a Kafka
topic.

*Spark3 Support*: We have added support for writing/querying data using
Spark 3. please be sure to use the scala 2.12 hudi-spark-bundle.

*Insert Overwrite/Insert Overwrite Table*: We have added these two new
write operation types, predominantly to help existing batch ETL jobs, which
typically overwrite entire tables/partitions each run. These operations are
much cheaper than having to issue upserts, given they are bulk replacing
the target table. Check here
<http://hudi.apache.org/docs/quick-start-guide.html#insert-overwrite-table> for
examples.

*Incremental Query on MOR (Spark Datasource)*: Spark datasource now has
experimental support for incremental queries on MOR table. This feature
will be hardened and certified in the next release, along with a large
overhaul of the spark datasource implementation. (sshh!:))

Thanks,
Vinoth
(on behalf of the Hudi Community)

Apache Hudi 0.7.0 Released

Reply via email to