[
https://issues.apache.org/jira/browse/SPARK-54601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-54601:
-----------------------------------
Labels: pull-request-available (was: )
> Data loss in TimestampType when writing to MS SQL Server: Microsecond
> precision truncated to milliseconds
> ---------------------------------------------------------------------------------------------------------
>
> Key: SPARK-54601
> URL: https://issues.apache.org/jira/browse/SPARK-54601
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0, 4.0.0
> Reporter: Vinod KC
> Priority: Major
> Labels: pull-request-available
>
> Spark silently loses microsecond precision when writing *timestamps* to MS
> SQL Server via JDBC, causing data corruption.
> Spark maps TimestampType/TimestampTypeNTZ to SQL Server's legacy DATETIME
> type which only supports millisecond precision (3 decimal places). The last 3
> decimal places (microseconds) are automatically truncated with no warning.
> Data Loss Example - Complete Round-Trip
>
> {code:java}
> import java.sql.Timestamp
> // Create data with Spark's full microsecond precision
> val data = Seq(
> (1, Timestamp.valueOf("2024-12-03 14:25:37.789123")),
> (2, Timestamp.valueOf("2024-12-03 14:25:37.789456"))
> ).toDF("id", "timestamp")
> // Original microseconds: 789123, 789456 (Spark stores 6 decimal places)
> data.collect().foreach(r =>
> println(s"Microseconds: ${r.getAs[Timestamp]("timestamp").getNanos / 1000}")
> )
> // Output: Microseconds: 789123, Microseconds: 789456
> // Write to SQL Server (mapped to DATETIME - only 3 decimal places!)
> data.write.mode("overwrite").jdbc(jdbcUrl, "test_table", props)
> // Read back
> val result = spark.read.jdbc(jdbcUrl, "test_table", props)
> // Retrieved microseconds: 789000, 789000 ❌ DATA LOST!
> result.collect().foreach(r =>
> println(s"Microseconds: ${r.getAs[Timestamp]("timestamp").getNanos / 1000}")
> )
> // Output: Microseconds: 789000, Microseconds: 789000
> // Lost: 123 and 456 microseconds respectively (digits 4-6)
> // Both records now appear simultaneous - ordering corrupted
> {code}
> *Root cause :*
> Spark currently maps TimestampType and TimestampNTZType to SQL Server's
> legacy DATETIME type, which only supports 3 decimal places (millisecond
> precision). This causes automatic truncation of any microsecond data beyond
> the third decimal place.
>
> *References:*
> Microsoft Documentation: "Avoid using datetime for new work. Instead, use the
> datetime2 data type."
> [https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql]
>
> {code:java}
> DATETIME limitations: 3.33 milliseconds precision (rounded)
> DATETIME2 precision: 100 nanoseconds (0.0000001 seconds){code}
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]