[GitHub] [hudi] danny0405 commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

via GitHub Tue, 02 May 2023 22:50:12 -0700


danny0405 commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1532480189


   Okay, finally I find out the reason for the failure from test 
`TestNestedSchemaPruningOptimization`.
   
   It is because we hard code the Parquet vectorized reader in base realtion: 
https://github.com/apache/hudi/blob/620f39a5fd5e1392819d530ea963f866c3f1c301/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala#L78
 in https://github.com/apache/hudi/pull/5168,
   
   after we upgrade to Spark 3.3.2, the whole stage codegen triggered in Spark 
physical plan during the code generation, and the whole stage code gen id doing 
the code generation based on per-row assumption (no vectorized reader 
supported).
   
   I have created a patch to fix this(also fix the compilure error for 
hudi-sync module). The patch disable the vectorized reader, but I'm not sure 
how it would impact the performance, would ping @xiarixiaoyao for a review ~
   
   
[5868.patch.zip](https://github.com/apache/hudi/files/11379613/5868.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

Reply via email to