Simeon Simeonov created SPARK-32630:
---------------------------------------

             Summary: Reduce user confusion and subtle bugs by optionally 
preventing date & timestamp comparison
                 Key: SPARK-32630
                 URL: https://issues.apache.org/jira/browse/SPARK-32630
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Simeon Simeonov


https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of date 
vs. timestamp comparison consistent with SQL, which, unfortunately, isn't 
consistent with common sense.

When dates are compared with timestamps, they are promoted to timestamps at 
midnight of the date, in the server timezone, which is almost always UTC. This 
only works well if all timestamps in the data are logically time instants as 
opposed to dates + times, which only become instants with a known timezone.

The fundamental issue is that dates are a human time concept and instant are a 
machine time concept. While we can technically promote one to the other, 
logically, it only works 100% if midnight for all dates in the system is in the 
server timezone. 

Every major modern platform offers a clear distinction between machine time 
(instants) and human time (an instant with a timezone, UTC offset, etc.), 
because we have learned the hard way that date & time handling is a 
never-ending source of confusion and bugs. SQL, being an ancient language (40+ 
years old), is well behind software engineering best practices; using it as a 
guiding light is necessary for Spark to win market share, but unfortunate in 
every other way.

For example, Java has:
 * java.time.LocalDate
 * java.time.Instant
 * java.time.ZonedDateTime
 * java.time.OffsetDateTime

I am not suggesting we add new data types to Spark. I am suggesting we go to 
the heart of the matter, which is that most date vs. time handling issues are 
the result of confusion or carelessness.

What about introducing a new setting that makes comparisons between dates and 
timestamps illegal, preferably with a helpful exception message?

If it existed, I would certainly make it the default for all our clusters. The 
minor coding convenience that comes from being able to compare dates & 
timestamps with an automatic type promotion pales in comparison with the risk 
of subtle bugs that remain undetected for a long time.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to