[ https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-3081: --------------------------------- Fix Version/s: 0.11.0 > Revisiting Read Path Infra across Query Engines > ----------------------------------------------- > > Key: HUDI-3081 > URL: https://issues.apache.org/jira/browse/HUDI-3081 > Project: Apache Hudi > Issue Type: Epic > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Fix For: 0.11.0 > > > Currently, our Read-path infrastructure is mostly disparate for each > individual Query Engine having the same flow replicated multiple times: > * Hive leverages hierarchy based off `InputFormat` class > * Spark leverages hierarchy based off `SnapshotRelation` > This leads to substantial duplication of virtually the same flows being > replicated multiple times and unfortunately now diverging due to out of sync > lifecycle (bug-fixes, etc). > h3. Proposal > > *Phase 1: Abstracting Common Functionality* > > {_}T-shirt{_}: 1-1.5 weeks > {_}Goal{_}: Abstract following common items to avoid duplication of the > complex sequences across Engines > * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, > {{{}RealtimeUnmergedRecordReader{}}}) > * > ** _These Readers should only differ in the way they handle the payload, > everything else should remain constant_ > * Abstract w/in common component (name TBD) > ** Listing current file-slices at the requested instant (handling the > timeline) > ** Creating Record Iterator for the provided file-slice > > REF > [https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-6680] -- This message was sent by Atlassian Jira (v8.20.1#820001)