Hi Nicolas,

Thanks for bringing up the discussion.  Spark's MOR snapshot relation
provides different readers for different splits such as base-file-only
split and regular split with base and log files.

https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala#L124
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L93

Jon is working on new Hudi Spark integration relying on a new
implementation of the ParquetFileFormat, so Spark optimizations can kick in
for MOR; see draft RFC here: https://github.com/apache/hudi/pull/9235.
Feel free to give feedback there.

Best,
- Ethan

On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris <nicolas.pa...@riseup.net>
wrote:

> Just to clarify: the read path described is all about RT views here only,
> not related to RO.
>
> On July 22, 2023 8:14:09 PM UTC, Nicolas Paris <nicolas.pa...@riseup.net>
> wrote:
> >I have been playing with the starrocks MOR hudi reader recently and it
> does an amazing work: it has two read paths:
> >
> >1. For partitions with log files, use the merging logic
> >2. For partitions with only parquet files, use the cow read logic
> >
> >As you know, the first path is slow bcoz it has merging overhead and
> can't provide any parquet benefit (pushdown, blooms...). In contrast, the
> second path is blazing fast.
> >
> >MOR comes with tons of compaction rules, and  having such behavior makes
> possible hot/cold partition management.
> >
> >One particular case is GDPR where usually old records are deleted/masked
> on a random distribution , while new partitions are free of changes.
> >
> >So far spark does not make distinction between log / log free partitions
> and I suspect adding such improvement would make MOR table more performant.
> >
> >I would be glad to work on such feature so please give early feedback if
> there is some blocker.
>

Reply via email to