Laurens Versluis created SPARK-41242:
----------------------------------------

             Summary: Spark cannot infer the correct partition columnType for 
timestamps with millisecond or microsecond fractions
                 Key: SPARK-41242
                 URL: https://issues.apache.org/jira/browse/SPARK-41242
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.1
         Environment: Windows 10, Spark 3.2.1 using Java 11
            Reporter: Laurens Versluis


We are observing some inconsistencies with Spark inferring the type of 
partitioned columns.

Summary:
 # If a partition, i.e., folder name's timestamp is in the format `yyyy-MM-dd 
HH:mm:ss` then spark will infer *correctly* that the type is TimeStampType.
 # If a partition, i.e., folder name's timestamp is in the format `yyyy-MM-dd 
HH:mm:ss.S` (millisecond fraction of second added) then spark will infer 
*incorrectly* that the type is StringType.
 # If a partition, i.e., folder name's timestamp is in the format `yyyy-MM-dd 
HH:mm:ss.n` (nano or millisecond fraction of second added) then spark will 
infer *incorrectly* that the type is StringType. Spark will, however, happily 
partitionBy a column that is filled with Java Instant objects (nanosecond 
accuracy) but lowers them to microsecond accuracy (6 digits) in the folder 
name. Likely because java 8 limitations.

 

Details:

We observed that queries sending strings such as `2017-05-03T05:17:58.143Z` 
were returning no results. Ruling out timezone differences, we instead looked 
at the dataset schema to find out that the column had the StringType type. This 
means that the string in the query was matched as string rather than a time, 
leading to no results. Validating that no basePath setting was setting things 
to StringType, we ran a few local tests.

We noticed that whenever a folder has a name like `colName=2019-08-11 
19%3A33%3A05`, Spark correctly infers the type.

However, if a partition folder has a name like `colName=2020-04-03 
04%3A48%3A28.467` then Spark infers it as a StringType. Looking at the source 
code of 
[PartitioningUtils|https://github.com/apache/spark/blob/a80899f8bef84bb471606f0bbef889aa5e81628b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L66],
 Spark should be matching the milliseconds fraction just fine.

 

Now this is where it gets more interesting. With Spark 3.0, there were some 
changes made in the parsing of timestamps due to the switch in calendars. 
[https://github.com/apache/spark/commit/3493162c78822de0563f7d736040aebdb81e936b]
 seemed to make some changes. With the config option 
`spark.sql.datetime.java8API.enabled`, Spark makes use of that java.time 
objects such as Instants.

Instants have a higher accuracy than milliseconds: nanoseconds. So I ran the 
following Java PoC code to observe Spark's behavior and also validate that no 
conflicting types were found in the folder names that caused spark to fall back 
to a StringType:
{code:java}
SparkSession sparkSession = SparkSession.builder()
        .master("local[*]")
        .config("spark.sql.datetime.java8API.enabled", true)
        .config("spark.sql.session.timeZone", "UTC")
        .getOrCreate();

var inst = Instant.now();
System.out.println(inst);

StructType schema = new StructType(new StructField[]{
        DataTypes.createStructField("datacol", DataTypes.StringType, false),
        DataTypes.createStructField("partitioncol", DataTypes.TimestampType, 
false)
});

List<Row> data = new ArrayList<>();
data.add(RowFactory.create("test", inst));
Dataset<Row> testDataset = sparkSession.createDataFrame(data, schema);

testDataset.show(false);
testDataset.write().mode(SaveMode.Overwrite).partitionBy("partitioncol").parquet("C:\\Dev\\testdata");

// read the data in again
sparkSession.read().parquet("C:\\Dev\\testdata").printSchema(); {code}
The instant prints as `2022-11-23T15:59:01.122726300Z` yet the `show()` 
function prints
{noformat}
+-------+--------------------------+
|datacol|partitioncol              |
+-------+--------------------------+
|test   |2022-11-23 15:59:01.122726|
+-------+--------------------------+{noformat}
Sure enough the schema shows StringType was inferred:

 
{noformat}
root
 |-- datacol: string (nullable = true)
 |-- partitioncol: string (nullable = true){noformat}
 

 

Also note that we went from 9 digits to 6, i.e., from nanosecond to 
microseconds. This is likely due to java 8 as explained in the SO post: 
https://stackoverflow.com/a/30135772 Nonetheless, Spark happily creates the 
partition folder with the name `partitioncol=2022-11-23%2015%3A59%3A01.122726`.

This begs the question, will Spark infer the type correctly given the folder 
name `partitioncol=2022-11-23%2015%3A59%3A01.122726` and the pattern 
`yyyy-MM-dd HH:mm:ss[.S]`? My guess is not. While the oracle docs mention no 
support for nanosecond or microsecond support in the dateformatter 
([https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html),] 
it appears that one can specify the `n` letter for nanosecond parsing, see 
[https://stackoverflow.com/a/41879425] 

I tried to build Spark locally and extend the 
`org.apache.spark.sql.execution.datasources.parquet.ParquetPartitionDiscoverySuite`
 tests, but I am getting exceptions when trying to build Spark. Having already 
spent some time on this, I am fairly sure this is a bug in Spark, but help or 
additional pointers are welcome.

 

I know that by specifying manually a schema will resolve it, as per 
[https://github.com/apache/spark/blob/a80899f8bef84bb471606f0bbef889aa5e81628b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L312]
 but ideally, we would like to leave this to Spark.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to