MichaelUryukin opened a new issue, #7617: URL: https://github.com/apache/hudi/issues/7617
**Describe the problem you faced** When we write a DF to a Hudi table, which is partitioned by column of a type "date", and the value of one of the rows for this column is NULL, Hudi will try to write the DF with "default" value instead (https://hudi.apache.org/docs/0.10.1/configurations#partitiondefault_name), write command (`df.write.format("hudi")....`) **succeeds**, but the read command (`spark.read.format("hudi")...`) **fails** on casting value `default` to `DateType` for partition column. Steps to reproduce the behaviour: 1. create sample dataframe with at least one row with birth_date = NULL ``` import datetime from pyspark.sql.types import StructType,StructField, StringType, DateType, IntegerType data = [ ("James","","Smith","36636",datetime.date(2000, 1, 1),3000), ("Michael","Rose","","40288",None,4000), ("Robert","","Williams","42114",None,4000), ("Maria","Anne","Jones","39192",None,4000) ] schema = StructType([ StructField("firstname",StringType(),True), StructField("middlename",StringType(),True), StructField("lastname",StringType(),True), StructField("id", StringType(), True), StructField("birth_date", DateType(), True), StructField("salary", IntegerType(), True)]) ``` 2. set-up Hudi table configs and write to it ``` table_name = 'glue_hudi_null_date_partition_issue' hudi_options = { 'className': 'org.apache.hudi', 'hoodie.datasource.write.precombine.field': 'id', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.table.name': table_name, 'hoodie.consistency.check.enabled': 'true', 'hive_sync.ignore_exceptions': 'false', 'hoodie.insert.shuffle.parallelism': '200', 'hoodie.bulkinsert.shuffle.parallelism': '200', 'hoodie.upsert.shuffle.parallelism': '200', 'hoodie.datasource.write.partitionpath.field': 'birth_date', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator' } df.write.format("hudi").options(**hudi_options).mode("overwrite").save(f"s3://bucket-name/{table_name}/") ``` 3. Read from this table: ``` spark.read.format("hudi").options(**hudi_options).load(f"s3://bucket-name/{table_name}/").show() ``` **Expected behavior** I would expect "write" command to fail. **Environment Description** * Hudi version : 0.10.1 * Spark version : 3.1.2 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Stacktrace** ``` Py4JJavaError: An error occurred while calling o444.load. : java.lang.RuntimeException: Failed to cast value `default` to `DateType` for partition column `birth_date` at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitionColumn(PartitioningUtils.scala:313) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartition(PartitioningUtils.scala:251) at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:37) at org.apache.hudi.HoodieFileIndex.$anonfun$getAllQueryPartitionPaths$3(HoodieFileIndex.scala:586) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike.map(TraversableLike.scala:233) at scala.collection.TraversableLike.map$(TraversableLike.scala:226) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:538) at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:602) at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:184) at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org