[ https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yue Zhang updated HUDI-5688: ---------------------------- Fix Version/s: 0.14.0 (was: 0.13.1) > schema field of EmptyRelation subtype of BaseRelation should not be null > ------------------------------------------------------------------------ > > Key: HUDI-5688 > URL: https://issues.apache.org/jira/browse/HUDI-5688 > Project: Apache Hudi > Issue Type: Bug > Components: core > Reporter: Pramod Biligiri > Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, > 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, > Main.java, pom.xml > > > If there are no completed instants in the table, and there is no user defined > schema for it as well (as represented by the userSpecifiedSchema field in > DataSource.scala), then the EmptyRelation returned by > DefaultSource.createRelation sets schema of the EmptyRelation to null. This > breaks the contract of Spark's BaseRelation, where the schema is a StructType > but is not expected to be null. > Module versions: current apache-hudi master (commit hash > abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12. > Following Hudi session reproduces the above issue: > spark.read.format("hudi") > .option("hoodie.datasource.query.type", "incremental") > .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA") > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76) > at > org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188) > ... 50 elided > Find attached a few screenshots which show the code flow and the buggy state > of the variables. Also find attached a Java file and pom.xml that can be used > to reproduce the same (sorry don't have deanonymized table -to share yet).- > The bug seems to have been introduced in this particular PR change: > [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220] > Initial work on that file has happened in this particular Jira > (https://issues.apache.org/jira/browse/HUDI-4363) and PR > (https://github.com/apache/hudi/pull/6046) respectively. -- This message was sent by Atlassian Jira (v8.20.10#820010)