[ https://issues.apache.org/jira/browse/HUDI-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Udit Mehrotra updated HUDI-1879: -------------------------------- Description: *Read as DataSource Tables* and *HoodieFileIndex* implementation that went in [https://github.com/apache/hudi/pull/2283] and [https://github.com/apache/hudi/pull/2651] has introduced a couple of major regressions for *Merge on Read* tables: * *_ro* *tables returning Snapshot results*: Since we are directly using Hudi DataSource now to query *_ro* and *_rt* MOR tables, the DataSource has no way to recognize the difference between read optimized and real time tables as it has no way to check for *table name*. In both these scenarios *{color:#172b4d}QUERY_TYPE_OPT_KEY{color}*{color:#172b4d} turns out to be *snapshot* by default, which is causing *MergeOnReadSnapshotRelation* to be used for querying thus returning snapshot results always.{color} * *{color:#172b4d}Partition pruning{color}* *{color:#172b4d}does not work{color}* *{color:#172b4d}for realtime queries{color}*{color:#172b4d}: The *MergeOnReadSnapshotRelation* is directly using *allFiles* to always fetch all the files without doing any partition pruning. This is a regression for Spark SQL real time queries because earlier partition pruning would work via InputFormat for these queries. Thus, it will have impact on rt queries performance.{color} was: *HoodieFileIndex* implementation that went in [https://github.com/apache/hudi/pull/2651] has introduced a couple of major regressions for *Merge on Read* tables: * *_ro* *tables returning Snapshot results*: Since we are directly using DataSource now to query both > Spark DataSource tables/HoodieFileIndex issues for Merge On Read > ---------------------------------------------------------------- > > Key: HUDI-1879 > URL: https://issues.apache.org/jira/browse/HUDI-1879 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration > Reporter: Udit Mehrotra > Priority: Blocker > Labels: sev:critical > > *Read as DataSource Tables* and *HoodieFileIndex* implementation that went in > [https://github.com/apache/hudi/pull/2283] and > [https://github.com/apache/hudi/pull/2651] has introduced a couple of major > regressions for *Merge on Read* tables: > * *_ro* *tables returning Snapshot results*: Since we are directly using > Hudi DataSource now to query *_ro* and *_rt* MOR tables, the DataSource has > no way to recognize the difference between read optimized and real time > tables as it has no way to check for *table name*. In both these scenarios > *{color:#172b4d}QUERY_TYPE_OPT_KEY{color}*{color:#172b4d} turns out to be > *snapshot* by default, which is causing *MergeOnReadSnapshotRelation* to be > used for querying thus returning snapshot results always.{color} > * *{color:#172b4d}Partition pruning{color}* *{color:#172b4d}does not > work{color}* *{color:#172b4d}for realtime queries{color}*{color:#172b4d}: The > *MergeOnReadSnapshotRelation* is directly using *allFiles* to always fetch > all the files without doing any partition pruning. This is a regression for > Spark SQL real time queries because earlier partition pruning would work via > InputFormat for these queries. Thus, it will have impact on rt queries > performance.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)