MichaelUryukin opened a new issue, #7617:
URL: https://github.com/apache/hudi/issues/7617

   **Describe the problem you faced**
   
   
   When we write a DF to a Hudi table, which is partitioned by column of a type 
"date", and the value of one of the rows for this column is NULL, Hudi will try 
to write the DF with "default" value instead 
(https://hudi.apache.org/docs/0.10.1/configurations#partitiondefault_name), 
write command (`df.write.format("hudi")....`) **succeeds**, but the read 
command (`spark.read.format("hudi")...`) **fails** on casting value `default` 
to `DateType` for partition column.
   
   
   Steps to reproduce the behaviour:
   1. create sample dataframe with at least one row with birth_date = NULL
   ```
   import datetime
   
   from pyspark.sql.types import StructType,StructField, StringType, DateType, 
IntegerType
   data = [
       ("James","","Smith","36636",datetime.date(2000, 1, 1),3000),
       ("Michael","Rose","","40288",None,4000),
       ("Robert","","Williams","42114",None,4000),
       ("Maria","Anne","Jones","39192",None,4000)
     ]
   
   schema = StructType([ 
       StructField("firstname",StringType(),True), 
       StructField("middlename",StringType(),True), 
       StructField("lastname",StringType(),True), 
       StructField("id", StringType(), True), 
       StructField("birth_date", DateType(), True), 
       StructField("salary", IntegerType(), True)])
   ```
   2.  set-up Hudi table configs and write to it
   
   ```
   table_name = 'glue_hudi_null_date_partition_issue'
   hudi_options = {
       'className': 'org.apache.hudi',
       'hoodie.datasource.write.precombine.field': 'id',
       'hoodie.datasource.write.recordkey.field': 'id',
       'hoodie.table.name': table_name,
       'hoodie.consistency.check.enabled': 'true',
       'hive_sync.ignore_exceptions': 'false',
       'hoodie.insert.shuffle.parallelism': '200',
       'hoodie.bulkinsert.shuffle.parallelism': '200',
       'hoodie.upsert.shuffle.parallelism': '200',
       'hoodie.datasource.write.partitionpath.field': 'birth_date',
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator'
   }
   
   
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(f"s3://bucket-name/{table_name}/")
   ```
   3. Read from this table:
   ```
   
spark.read.format("hudi").options(**hudi_options).load(f"s3://bucket-name/{table_name}/").show()
   ```
   
   
   **Expected behavior**
    I would expect "write" command to fail.
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.1.2
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   
   **Stacktrace**
   
   ```
   Py4JJavaError: An error occurred while calling o444.load.
   : java.lang.RuntimeException: Failed to cast value `default` to `DateType` 
for partition column `birth_date`
        at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitionColumn(PartitioningUtils.scala:313)
        at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartition(PartitioningUtils.scala:251)
        at 
org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:37)
        at 
org.apache.hudi.HoodieFileIndex.$anonfun$getAllQueryPartitionPaths$3(HoodieFileIndex.scala:586)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at scala.collection.TraversableLike.map(TraversableLike.scala:233)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at 
org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:538)
        at 
org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:602)
        at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
        at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:184)
        at 
org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
        at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
        at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to