[ 
https://issues.apache.org/jira/browse/HUDI-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3396:
-----------------------------
    Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, 
Hudi-Sprint-Mar-01, Hudi-Sprint-Mar-07  (was: Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01)

> Make sure Spark reads only Projected Columns for both MOR/COW
> -------------------------------------------------------------
>
>                 Key: HUDI-3396
>                 URL: https://issues.apache.org/jira/browse/HUDI-3396
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: performance, pull-request-available, spark
>             Fix For: 0.11.0
>
>         Attachments: Screen Shot 2022-02-08 at 4.58.12 PM.png
>
>
> Spark Relation impl for MOR table seem to have following issues:
>  * `requiredSchemaParquetReader` still leverages full table schema, entailing 
> that we're fetching *all* columns from Parquet (even though the query might 
> just be projecting a handful) 
>  * `fullSchemaParquetReader` is always reading full-table to (presumably)be 
> able to do merging which might access arbitrary key-fields. This seems 
> superfluous, since we can only fetch the fields designated as 
> `PRECOMBINE_FIELD_NAME` as well as `RECORDKEY_FIELD_NAME`. We won't be able 
> to do that if either of the following is true:
>  ** Virtual Keys are used (key-gen will require whole payload)
>  ** Non-trivial merging strategy is used requiring whole record payload
>  * We don't seem to properly push-down data filters to Parquet reader when 
> reading whole table
>  
> AIs
>  * Make sure COW tables _only_ read projected columns
>  * Make sure MOR tables _only_ read projected columns, except when either of
>  ** Non-standard Record Payload class is used (for merging) 
>  ** Virtual keys are used
>  * +Write tests for Spark DataSource asserting that only projected columns 
> are being fetched+
>  
> !Screen Shot 2022-02-08 at 4.58.12 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to