Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-11 Thread via GitHub


ad1happy2go commented on issue #9845:
URL: https://github.com/apache/hudi/issues/9845#issuecomment-1757851940

   @noahtaite Yes, Converting to integer type before saving will work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-11 Thread via GitHub


noahtaite commented on issue #9845:
URL: https://github.com/apache/hudi/issues/9845#issuecomment-1757792128

   Hello @danny0405 we are ingesting from ~300 tables in ~2k customer databases 
which are not fully constrained - we can expect to see null values in many 
fields. In this case I believe it is a missing linking ID.
   
   @ad1happy2go happy to hear you have reproduced the issue, looking forward to 
hearing about a workaround and timeline for fix.
   
   Two things to note
   1 -  I also reproduced this issue with ByteType which it seems that Hudi is 
handling exact same as ShortType
   2 - Our current workaround (temp + hacky) is to convert all incoming 
ShortType + ByteType to IntegerType before saving to Hudi. This is working in 
our dev environment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-10 Thread via GitHub


ad1happy2go commented on issue #9845:
URL: https://github.com/apache/hudi/issues/9845#issuecomment-1756828539

   @noahtaite Thanks for raising this issue. I confirmed that hudi is not 
handling Short Types well. Even it is not related to if it is nullable or not. 
This is working well with other file formats.
   
   Created JIRA for the same - https://issues.apache.org/jira/browse/HUDI-6936
   
   ```
   val schema = StructType(Array(
   StructField("datasource",StringType,true),
   StructField("id",IntegerType,true),
   StructField("shortid",ShortType,false),
   StructField("longid",LongType,true)
   ))
   
   val data = Seq(
   Row("partition1", 11, 1011.asInstanceOf[Short], 1011L),
   Row("partition1", 22, 2011.asInstanceOf[Short], 2011L),
   Row("partition1", 33, 2012.asInstanceOf[Short], 3011L)
   )
   
   var df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
   
   df.count //Returns 4
   val path = "file:///tmp/test_hudi40"
   
   df.write.format("hudi").
 option(PRECOMBINE_FIELD_OPT_KEY, "id").
 option(RECORDKEY_FIELD_OPT_KEY, "id").
 option(PARTITIONPATH_FIELD_OPT_KEY, "datasource").
 option(OPERATION_OPT_KEY, "bulk_insert").
 option(TABLE_NAME, "test.all_hudi").
 mode(Append).
 save(path)
   
   val loaded_df = spark.read.format("hudi").load(path)
   loaded_df.printSchema
   
   root
|-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- id: integer (nullable = true)
|-- shortid: integer (nullable = false)
|-- longid: long (nullable = true)
|-- datasource: string (nullable = true)
   
   loaded_df.show()
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-10 Thread via GitHub


danny0405 commented on issue #9845:
URL: https://github.com/apache/hudi/issues/9845#issuecomment-1756711558

   Why a short type could be null?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] Materializing nullable ShortType columns throws NullPointerException [hudi]

2023-10-10 Thread via GitHub


noahtaite opened a new issue, #9845:
URL: https://github.com/apache/hudi/issues/9845

   **Describe the problem you faced**
   
   On AWS EMR 6.12 + 6.11.1 (running Hudi 0.13.0-amzn-0 + Spark 3.4.0/3.3.2), 
we are getting a NullPointerException when attempting to materialize (count or 
save) a result generated from an existing Hudi lake that has a nullable 
"integer" column that was converted from its' original type of "short".
   
   The Hudi lake was generated with EMR 6.12 with no problem. My original field 
had a parquet type of "ShortType" which was loaded into the Hudi table as 
"IntegerType" (see  But when we read a field that was originally "ShortType" 
and has both null and non-null values, we get a NullPointerException.
   
   We can materialize the same column using EMR 6.9 (running Hudi 0.12.1-amzn-0 
+ Spark 3.3.0). So our users have had to downgrade their applications in order 
to use this table properly.
   
   Why are nullable fields that were originally short and so converted to 
integer by Hudi failing when we have null values? Is there a workaround to use 
latest version of Hudi but also use this column? 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   ```scala
   val schema = StructType(Array(
   StructField("datasource",StringType,true),
   StructField("id",IntegerType,true),
   StructField("shortid",ShortType,true),
   StructField("longid",LongType,true)
   ))
   
   val data = Seq(
   Row("partition1", 11, 1011.asInstanceOf[Short], 1011L),
   Row("partition1", 22, 2011.asInstanceOf[Short], 2011L),
   Row("partition1", 33, null, 3011L),
   Row("partition1", 44, 4011.asInstanceOf[Short], null)
   )
   
   var df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
   
   df.count //Returns 4
   
   df.write.format("hudi").
 option(PRECOMBINE_FIELD_OPT_KEY, "id").
 option(RECORDKEY_FIELD_OPT_KEY, "id").
 option(PARTITIONPATH_FIELD_OPT_KEY, "warehouse").
 option(OPERATION_OPT_KEY, "bulk_insert").
 option(TABLE_NAME, "test.all_hudi").
 mode(Append).
 save("s3://hudi-lake/hudi-table")
 
   val loaded_df = spark.read.format("hudi").load("s3://hudi-lake/hudi-table")
   
   //NULL POINTER EXCEPTION:
   loaded_df.groupBy("shortid").count.sort(col("count").desc).show(1000,false)
   ```
   
   **Expected behavior**
   
   I expect my nullable short columns to be able to be materialized regardless 
of if there is null values or not. 
   
   I understand Hudi is converting base Spark ShortTypes -> IntegerType. This 
is expected, but it should not fail when attempting to materialize the 
"shortid" field.
   
   
   **Environment Description**
   
   * Hudi version : 0.13.1-amzn-0
   
   * Spark version : 3.4.0 + 3.3.2
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Tested on AWS EMR 6.12, 6.11.1, 6.11.0, 6.10.0, and 6.9.0. Only 6.9.0 and 
6.10.0 were successful so this appears to be a regression.
   
   Maybe related to https://github.com/apache/hudi/issues/4233 ? Not sure 
because this is working in older versions but not in the latest and greatest 
code.
   
   **Stacktrace**
   
   ```23/10/10 15:51:39 INFO S3NativeFileSystem: Opening 
's3://hudi-lake/hudi-table/datasource=partition1/c978d385-a0a7-4634-b92b-4ab2204192ef-0_95-768-83167_20231010135439571.parquet'
 for reading
   23/10/10 15:51:39 ERROR Utils: Aborting task
   java.lang.NullPointerException: null
   at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:314)
 ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.sql.vectorized.ColumnarBatchRow.getInt(ColumnarBatchRow.java:106)
 ~[spark-catalyst_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?]
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
 ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
 Source) ~[?:?]
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:959)
 ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:91)
 ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:404)
 ~[spark-sql_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1575)
 ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
   at 
org.apache.spark.sql.executio