[ https://issues.apache.org/jira/browse/HUDI-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-3396: ----------------------------- Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01, Hudi-Sprint-Mar-07 (was: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14, Hudi-Sprint-Feb-22, Hudi-Sprint-Mar-01) > Make sure Spark reads only Projected Columns for both MOR/COW > ------------------------------------------------------------- > > Key: HUDI-3396 > URL: https://issues.apache.org/jira/browse/HUDI-3396 > Project: Apache Hudi > Issue Type: Bug > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Labels: performance, pull-request-available, spark > Fix For: 0.11.0 > > Attachments: Screen Shot 2022-02-08 at 4.58.12 PM.png > > > Spark Relation impl for MOR table seem to have following issues: > * `requiredSchemaParquetReader` still leverages full table schema, entailing > that we're fetching *all* columns from Parquet (even though the query might > just be projecting a handful) > * `fullSchemaParquetReader` is always reading full-table to (presumably)be > able to do merging which might access arbitrary key-fields. This seems > superfluous, since we can only fetch the fields designated as > `PRECOMBINE_FIELD_NAME` as well as `RECORDKEY_FIELD_NAME`. We won't be able > to do that if either of the following is true: > ** Virtual Keys are used (key-gen will require whole payload) > ** Non-trivial merging strategy is used requiring whole record payload > * We don't seem to properly push-down data filters to Parquet reader when > reading whole table > > AIs > * Make sure COW tables _only_ read projected columns > * Make sure MOR tables _only_ read projected columns, except when either of > ** Non-standard Record Payload class is used (for merging) > ** Virtual keys are used > * +Write tests for Spark DataSource asserting that only projected columns > are being fetched+ > > !Screen Shot 2022-02-08 at 4.58.12 PM.png! > -- This message was sent by Atlassian Jira (v8.20.1#820001)