Hello all, We are excited to share that the 0.7.0 release is out, and by far our biggest release with lots of code moving around, new unique features and bug fixes.
Please find more information here and provide feedback http://hudi.apache.org/releases.html#release-070-docs Few quick highlights: Clustering:P <http://hudi.apache.org/releases.html#clustering>0.7.0 brings the ability to cluster your Hudi tables, to optimize for file sizes and also storage layout. Hudi will continue to enforce file sizes, as it always has been, during the write. Clustering provides more flexibility to increase the file sizes down the line or ability to ingest data at much fresher intervals, and later coalesce them into bigger files. This is very similar to the benefits of clustering delivered by cloud data warehouses <https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html>. We are proud to announce that such capability is freely available in open source, for the first time, through the 0.7.0 release.Metadata Table: 0.7.0 lays out the foundation for storing more indexes, metadata in an internal metadata table, which is implemented using a Hudi MOR table - which means it’s compacted, cleaned and also incrementally updated like any other Hudi table. By hoodie.metadata.enable=true on the writer side, will populate the metadata table with file system listings so all operations don’t have to explicitly use fs.listStatus() anymore on data partitions. In our testing, on a large 250K file table, the metadata table delivers 2-3x speedup <https://github.com/apache/hudi/pull/2441#issuecomment-761742963> over parallelized listing done by the Hudi spark writer. Users can also leverage the metadata table on the query side for the following query paths. For Hive, setting the hoodie.metadata.enable=true session property and for SparkSQL on Hive registered tables using --conf spark.hadoop.hoodie.metadata.enable=true, allows the file listings for the partition to be fetched out of the metadata table, instead of listing the underlying DFS partition. More engines are coming. Java/Flink Writers: In 0.7.0, we have additionally added Java and Flink based writers, as initial steps. Specifically, the HoodieFlinkStreamer allows for Hudi Copy-On-Write table to be built by streaming data from a Kafka topic. *Spark3 Support*: We have added support for writing/querying data using Spark 3. please be sure to use the scala 2.12 hudi-spark-bundle. *Insert Overwrite/Insert Overwrite Table*: We have added these two new write operation types, predominantly to help existing batch ETL jobs, which typically overwrite entire tables/partitions each run. These operations are much cheaper than having to issue upserts, given they are bulk replacing the target table. Check here <http://hudi.apache.org/docs/quick-start-guide.html#insert-overwrite-table> for examples. *Incremental Query on MOR (Spark Datasource)*: Spark datasource now has experimental support for incremental queries on MOR table. This feature will be hardened and certified in the next release, along with a large overhaul of the spark datasource implementation. (sshh!:)) Thanks, Vinoth (on behalf of the Hudi Community)