danny0405 commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1532480189
Okay, finally I find out the reason for the failure from test `TestNestedSchemaPruningOptimization`. It is because we hard code the Parquet vectorized reader in base realtion: https://github.com/apache/hudi/blob/620f39a5fd5e1392819d530ea963f866c3f1c301/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala#L78 in https://github.com/apache/hudi/pull/5168, after we upgrade to Spark 3.3.2, the whole stage codegen triggered in Spark physical plan during the code generation, and the whole stage code gen id doing the code generation based on per-row assumption (no vectorized reader supported). I have created a patch to fix this(also fix the compilure error for hudi-sync module). The patch disable the vectorized reader, but I'm not sure how it would impact the performance, would ping @xiarixiaoyao for a review ~ [5868.patch.zip](https://github.com/apache/hudi/files/11379613/5868.patch.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org