noahtaite opened a new issue, #9845:
URL: https://github.com/apache/hudi/issues/9845
**Describe the problem you faced**
On AWS EMR 6.12 + 6.11.1 (running Hudi 0.13.0-amzn-0 + Spark 3.4.0/3.3.2),
we are getting a NullPointerException when attempting to materialize (count or
save) a result generated from an existing Hudi lake that has a nullable
"integer" column that was converted from its' original type of "short".
The Hudi lake was generated with EMR 6.12 with no problem. My original field
had a parquet type of "ShortType" which was loaded into the Hudi table as
"IntegerType" (see But when we read a field that was originally "ShortType"
and has both null and non-null values, we get a NullPointerException.
We can materialize the same column using EMR 6.9 (running Hudi 0.12.1-amzn-0
+ Spark 3.3.0). So our users have had to downgrade their applications in order
to use this table properly.
Why are nullable fields that were originally short and so converted to
integer by Hudi failing when we have null values? Is there a workaround to use
latest version of Hudi but also use this column?
**To Reproduce**
Steps to reproduce the behavior:
```scala
val schema = StructType(Array(
StructField("datasource",StringType,true),
StructField("id",IntegerType,true),
StructField("shortid",ShortType,true),
StructField("longid",LongType,true)
))
val data = Seq(
Row("partition1", 11, 1011.asInstanceOf[Short], 1011L),
Row("partition1", 22, 2011.asInstanceOf[Short], 2011L),
Row("partition1", 33, null, 3011L),
Row("partition1", 44, 4011.asInstanceOf[Short], null)
)
var df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.count //Returns 4
df.write.format("hudi").
option(PRECOMBINE_FIELD_OPT_KEY, "id").
option(RECORDKEY_FIELD_OPT_KEY, "id").
option(PARTITIONPATH_FIELD_OPT_KEY, "warehouse").
option(OPERATION_OPT_KEY, "bulk_insert").
option(TABLE_NAME, "test.all_hudi").
mode(Append).
save("s3://hudi-lake/hudi-table")
val loaded_df = spark.read.format("hudi").load("s3://hudi-lake/hudi-table")
//NULL POINTER EXCEPTION:
loaded_df.groupBy("shortid").count.sort(col("count").desc).show(1000,false)
```
**Expected behavior**
I expect my nullable short columns to be able to be materialized regardless
of if there is null values or not.
I understand Hudi is converting base Spark ShortTypes -> IntegerType. This
is expected, but it should not fail when attempting to materialize the
"shortid" field.
**Environment Description**
* Hudi version : 0.13.1-amzn-0
* Spark version : 3.4.0 + 3.3.2
* Hive version : 3.1.3
* Hadoop version : 3.3.3
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Additional context**
Tested on AWS EMR 6.12, 6.11.1, 6.11.0, 6.10.0, and 6.9.0. Only 6.9.0 and
6.10.0 were successful so this appears to be a regression.
Maybe related to https://github.com/apache/hudi/issues/4233 ? Not sure
because this is working in older versions but not in the latest and greatest
code.
**Stacktrace**
```23/10/10 15:51:39 INFO S3NativeFileSystem: Opening
's3://hudi-lake/hudi-table/datasource=partition1/c978d385-a0a7-4634-b92b-4ab2204192ef-0_95-768-83167_20231010135439571.parquet'
for reading
23/10/10 15:51:39 ERROR Utils: Aborting task
java.lang.NullPointerException: null
at
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:314)
~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.sql.vectorized.ColumnarBatchRow.getInt(ColumnarBatchRow.java:106)
~[spark-catalyst_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) ~[?:?]
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
Source) ~[?:?]
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:959)
~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:91)
~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:404)
~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1575)
~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
at
org.apache.spark.sql.executio