Dear community, Nice to share Hudi community bi-weekly updates for 2022-02-28 ~ 2022-03-13 with updates on features, bug fixes.
======================================= Features [Spark] Adding Datatable validator tool [1] [Core] Add support for "marker delete" in hudi-cli [2] [Flink] RFC-35 Part-1 Support bucket index in Flink writer [3] [Spark] RFC-27: Data skipping index to improve query performance [4] [Spark] Support Clustering Command Based on Call Procedure Command for Spark SQL [5] [Spark] [RFC-47] Add Call Produce Command for Spark SQL [6] [Core] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema [7] [Core] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet [8] [Flink] Add DFS based message queue for flink writer[part3] [9] [Core] Support querying a table as of a savepoint [10] [Core] Introduce ChainedSchemaPostProcessor to support setting multi processors at once [11] [Core] Introduce DropColumnSchemaPostProcessor to support drop columns from schema [12] [Core] [RFC-42] RFC for consistent hashing index [13] [Core] Support savepoints command based on Call Produce Command [14] [1] https://issues.apache.org/jira/browse/HUDI-3497 [2] https://issues.apache.org/jira/browse/HUDI-3441 [3] https://issues.apache.org/jira/browse/HUDI-3315 [4] https://issues.apache.org/jira/browse/HUDI-2973 [5] https://issues.apache.org/jira/browse/HUDI-3445 [6] https://issues.apache.org/jira/browse/HUDI-3161 [7] https://issues.apache.org/jira/browse/HUDI-3520 [8] https://issues.apache.org/jira/browse/HUDI-3525 [9] https://issues.apache.org/jira/browse/HUDI-2677 [10] https://issues.apache.org/jira/browse/HUDI-3221 [11] https://issues.apache.org/jira/browse/HUDI-3568 [12] https://issues.apache.org/jira/browse/HUDI-3522 [13] https://issues.apache.org/jira/browse/HUDI-2999 [14] https://issues.apache.org/jira/browse/HUDI-3501 ======================================= Bugs [Core] Fixing kakfa key and value serializer value type from class to string [1] [Core] Adding validation to dataframe scheme to ensure reserved field does not have diff data type [2] [Core] rollback insert data appended to log file when using Hbase Index [3] [Core] Fix String convert issue and overwrite putAll method in TypedProperties.java [4] [Core] Fix log file reader for S3 with hadoop-aws 2.7.x [5] [CLI] Avoid passing empty string spark master to hudi cli [6] [Core] Save timeout option for remote RemoteFileSystemView [7] [Core] Add validation of column stats and bloom filters in HoodieMetadataTableValidator [8] [Core] Implement record iterator for HoodieDataBlock [9] [Flink] In CompactFunction, set up the write schema each time with the latest schema[10] [Core] made schema registry urls configurable with MTDS [11] [Core] Fixing "populate meta fields" update to metadata table [12] [Core] Fix if user specify key "hoodie.datasource.clustering.async.enable" directly, async clustering not work [13] [Flink] Add reader merge memory option for flink [14] [Core] Fixing timeline server for repeated refreshes [15] [Core] Fixing Hive getSchema for RT tables addressing different partitions having different schemas[16] [Core] Improve HoodieMergedLogRecordScanner avoid putting unnecessary hoodie records [17] [Core] Making commit preserve metadata to true for compaction [18] [Core] Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize [19] [Core] Make sure Metadata Table records are updated appropriately on HDFS [20] [Core] support set --sparkMaster for MDT cli [21] [Core] Configuring timeline refreshes based on latest commit [22] [Flink] Flink cleanFuntion execute clean on initialization [23] [Build] Improve maven module configs for different spark profiles [24] [Core] HoodieData for metadata index records; BloomFilter construction from index based on the type param [25] [Core] Sync column comments while syncing a hive table [26] [Core] Make sure BaseFileOnlyViewRelation only reads projected columns [27] [Core] Fixing NULL schema provider for empty batch [28] [Core] Refactor HoodieCommonUtils to make code more reasonable [29] [Core] Make sure Column Stats does not fail in case it fails to load previous Index Table state [30] [Core] Fix NPE of DefaultHoodieRecordPayload if Property is empty [31] [Core] Re-use rollback instant for rolling back of clustering and compaction if rollback failed mid-way [32] [Core] Restore TypedProperties and flush checksum in table config [33] [Core] Fix MarkerBasedRollbackStrategy NoSuchElementException [34] [1] https://issues.apache.org/jira/browse/HUDI-3521 [2] https://issues.apache.org/jira/browse/HUDI-3018 [3] https://issues.apache.org/jira/browse/HUDI-2917 [4] https://issues.apache.org/jira/browse/HUDI-3528 [5] https://issues.apache.org/jira/browse/HUDI-3341 [6] https://issues.apache.org/jira/browse/HUDI-3450 [7] https://issues.apache.org/jira/browse/HUDI-3418 [8] https://issues.apache.org/jira/browse/HUDI-3465 [9] https://issues.apache.org/jira/browse/HUDI-3516 [10] https://issues.apache.org/jira/browse/HUDI-2631 [11] https://issues.apache.org/jira/browse/HUDI-3264 [12] https://issues.apache.org/jira/browse/HUDI-3544 [13] https://issues.apache.org/jira/browse/HUDI-3548 [14] https://issues.apache.org/jira/browse/HUDI-3460 [15] https://issues.apache.org/jira/browse/HUDI-2761 [16] https://issues.apache.org/jira/browse/HUDI-3130 [17] https://issues.apache.org/jira/browse/HUDI-3069 [18] https://issues.apache.org/jira/browse/HUDI-3213 [19] https://issues.apache.org/jira/browse/HUDI-3561 [20] https://issues.apache.org/jira/browse/HUDI-3365 [21] https://issues.apache.org/jira/browse/HUDI-2747 [22] https://issues.apache.org/jira/browse/HUDI-3576 [23] https://issues.apache.org/jira/browse/HUDI-3573 [24] https://issues.apache.org/jira/browse/HUDI-3574 [25] https://issues.apache.org/jira/browse/HUDI-3356 [26] https://issues.apache.org/jira/browse/HUDI-3383 [27] https://issues.apache.org/jira/browse/HUDI-3396 [28] https://issues.apache.org/jira/browse/HUDI-3595 [29] https://issues.apache.org/jira/browse/HUDI-3567 [30] https://issues.apache.org/jira/browse/HUDI-3513 [31] https://issues.apache.org/jira/browse/HUDI-3592 [32] https://issues.apache.org/jira/browse/HUDI-3556 [33] https://issues.apache.org/jira/browse/HUDI-3593 [34] https://issues.apache.org/jira/browse/HUDI-3583 =================================== Tests [Tests] Refactor HoodieTestDataGenerator to provide for reproducible Builds [1] [Tests] Add UT to verify HoodieRealtimeFileSplit serde [2] [Tests] Skip integ test modules by default [3] [Tests] Add Trino Queries in integration tests [4] [Tests] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor [5] [1] https://issues.apache.org/jira/browse/HUDI-3469 [2] https://issues.apache.org/jira/browse/HUDI-3348 [3] https://issues.apache.org/jira/browse/HUDI-3584 [4] https://issues.apache.org/jira/browse/HUDI-3586 [5] https://issues.apache.org/jira/browse/HUDI-3575 Best, Leesf