Laurens Versluis created SPARK-41242: ----------------------------------------
Summary: Spark cannot infer the correct partition columnType for timestamps with millisecond or microsecond fractions Key: SPARK-41242 URL: https://issues.apache.org/jira/browse/SPARK-41242 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Environment: Windows 10, Spark 3.2.1 using Java 11 Reporter: Laurens Versluis We are observing some inconsistencies with Spark inferring the type of partitioned columns. Summary: # If a partition, i.e., folder name's timestamp is in the format `yyyy-MM-dd HH:mm:ss` then spark will infer *correctly* that the type is TimeStampType. # If a partition, i.e., folder name's timestamp is in the format `yyyy-MM-dd HH:mm:ss.S` (millisecond fraction of second added) then spark will infer *incorrectly* that the type is StringType. # If a partition, i.e., folder name's timestamp is in the format `yyyy-MM-dd HH:mm:ss.n` (nano or millisecond fraction of second added) then spark will infer *incorrectly* that the type is StringType. Spark will, however, happily partitionBy a column that is filled with Java Instant objects (nanosecond accuracy) but lowers them to microsecond accuracy (6 digits) in the folder name. Likely because java 8 limitations. Details: We observed that queries sending strings such as `2017-05-03T05:17:58.143Z` were returning no results. Ruling out timezone differences, we instead looked at the dataset schema to find out that the column had the StringType type. This means that the string in the query was matched as string rather than a time, leading to no results. Validating that no basePath setting was setting things to StringType, we ran a few local tests. We noticed that whenever a folder has a name like `colName=2019-08-11 19%3A33%3A05`, Spark correctly infers the type. However, if a partition folder has a name like `colName=2020-04-03 04%3A48%3A28.467` then Spark infers it as a StringType. Looking at the source code of [PartitioningUtils|https://github.com/apache/spark/blob/a80899f8bef84bb471606f0bbef889aa5e81628b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L66], Spark should be matching the milliseconds fraction just fine. Now this is where it gets more interesting. With Spark 3.0, there were some changes made in the parsing of timestamps due to the switch in calendars. [https://github.com/apache/spark/commit/3493162c78822de0563f7d736040aebdb81e936b] seemed to make some changes. With the config option `spark.sql.datetime.java8API.enabled`, Spark makes use of that java.time objects such as Instants. Instants have a higher accuracy than milliseconds: nanoseconds. So I ran the following Java PoC code to observe Spark's behavior and also validate that no conflicting types were found in the folder names that caused spark to fall back to a StringType: {code:java} SparkSession sparkSession = SparkSession.builder() .master("local[*]") .config("spark.sql.datetime.java8API.enabled", true) .config("spark.sql.session.timeZone", "UTC") .getOrCreate(); var inst = Instant.now(); System.out.println(inst); StructType schema = new StructType(new StructField[]{ DataTypes.createStructField("datacol", DataTypes.StringType, false), DataTypes.createStructField("partitioncol", DataTypes.TimestampType, false) }); List<Row> data = new ArrayList<>(); data.add(RowFactory.create("test", inst)); Dataset<Row> testDataset = sparkSession.createDataFrame(data, schema); testDataset.show(false); testDataset.write().mode(SaveMode.Overwrite).partitionBy("partitioncol").parquet("C:\\Dev\\testdata"); // read the data in again sparkSession.read().parquet("C:\\Dev\\testdata").printSchema(); {code} The instant prints as `2022-11-23T15:59:01.122726300Z` yet the `show()` function prints {noformat} +-------+--------------------------+ |datacol|partitioncol | +-------+--------------------------+ |test |2022-11-23 15:59:01.122726| +-------+--------------------------+{noformat} Sure enough the schema shows StringType was inferred: {noformat} root |-- datacol: string (nullable = true) |-- partitioncol: string (nullable = true){noformat} Also note that we went from 9 digits to 6, i.e., from nanosecond to microseconds. This is likely due to java 8 as explained in the SO post: https://stackoverflow.com/a/30135772 Nonetheless, Spark happily creates the partition folder with the name `partitioncol=2022-11-23%2015%3A59%3A01.122726`. This begs the question, will Spark infer the type correctly given the folder name `partitioncol=2022-11-23%2015%3A59%3A01.122726` and the pattern `yyyy-MM-dd HH:mm:ss[.S]`? My guess is not. While the oracle docs mention no support for nanosecond or microsecond support in the dateformatter ([https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html),] it appears that one can specify the `n` letter for nanosecond parsing, see [https://stackoverflow.com/a/41879425] I tried to build Spark locally and extend the `org.apache.spark.sql.execution.datasources.parquet.ParquetPartitionDiscoverySuite` tests, but I am getting exceptions when trying to build Spark. Having already spent some time on this, I am fairly sure this is a bug in Spark, but help or additional pointers are welcome. I know that by specifying manually a schema will resolve it, as per [https://github.com/apache/spark/blob/a80899f8bef84bb471606f0bbef889aa5e81628b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L312] but ideally, we would like to leave this to Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org