[ 
https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Zhang updated HUDI-5688:
----------------------------
    Fix Version/s: 0.14.0
                       (was: 0.13.1)

> schema field of EmptyRelation subtype of BaseRelation should not be null
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5688
>                 URL: https://issues.apache.org/jira/browse/HUDI-5688
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: core
>            Reporter: Pramod Biligiri
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.14.0
>
>         Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 
> 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, 
> Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined 
> schema for it as well (as represented by the userSpecifiedSchema field in 
> DataSource.scala), then the EmptyRelation returned by 
> DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
> breaks the contract of Spark's BaseRelation, where the schema is a StructType 
> but is not expected to be null.
> Module versions: current apache-hudi master (commit hash 
> abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
>             .option("hoodie.datasource.query.type", "incremental") 
> .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
>   at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
>   ... 50 elided  
> Find attached a few screenshots which show the code flow and the buggy state 
> of the variables. Also find attached a Java file and pom.xml that can be used 
> to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change: 
> [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira 
> (https://issues.apache.org/jira/browse/HUDI-4363) and PR 
> (https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to