Hi Nicolas, Thanks for bringing up the discussion. Spark's MOR snapshot relation provides different readers for different splits such as base-file-only split and regular split with base and log files.
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala#L124 https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L93 Jon is working on new Hudi Spark integration relying on a new implementation of the ParquetFileFormat, so Spark optimizations can kick in for MOR; see draft RFC here: https://github.com/apache/hudi/pull/9235. Feel free to give feedback there. Best, - Ethan On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris <nicolas.pa...@riseup.net> wrote: > Just to clarify: the read path described is all about RT views here only, > not related to RO. > > On July 22, 2023 8:14:09 PM UTC, Nicolas Paris <nicolas.pa...@riseup.net> > wrote: > >I have been playing with the starrocks MOR hudi reader recently and it > does an amazing work: it has two read paths: > > > >1. For partitions with log files, use the merging logic > >2. For partitions with only parquet files, use the cow read logic > > > >As you know, the first path is slow bcoz it has merging overhead and > can't provide any parquet benefit (pushdown, blooms...). In contrast, the > second path is blazing fast. > > > >MOR comes with tons of compaction rules, and having such behavior makes > possible hot/cold partition management. > > > >One particular case is GDPR where usually old records are deleted/masked > on a random distribution , while new partitions are free of changes. > > > >So far spark does not make distinction between log / log free partitions > and I suspect adding such improvement would make MOR table more performant. > > > >I would be glad to work on such feature so please give early feedback if > there is some blocker. >