GitHub user geserdugarov edited a discussion: Spark DataSource V2 read and write benchmarks?
Integration of Spark Datasource V2 was done in [RFC-38](https://github.com/apache/hudi/pull/3964). However, there were multiple issues with advertising a Hudi table as V2 without actually implementing certain APIs, and with using custom relation rule to fall back to V1 API. As a result, the current implementation of `HoodieCatalog` and `Spark3DefaultSource` returns a `V1Table` instead of `HoodieInternalV2Table`, in order to [address performance regressions](https://github.com/apache/hudi/pull/5737). Performance issues were not revealed in the initial PR due to the absence of proper benchmarking for such changes. Therefore, to restart this work, it is important first to decide how to benchmark the changes. Among other things, Datasource V1 allows custom logic, such as the use of Hudi indexes, which is not straightforward to implement in Datasource V2. So we need to consider cases like this in the benchmarking scenarios. If anybody has already gone down this path, please share your insights. Any suggestions about scenarios that should be considered are also welcome. GitHub link: https://github.com/apache/hudi/discussions/13955 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
