[ https://issues.apache.org/jira/browse/SPARK-32630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178624#comment-17178624 ]
Simeon Simeonov commented on SPARK-32630: ----------------------------------------- [~rxin] fyi, one of the subtle issues that add friction to new users being safely productive on the platform. > Reduce user confusion and subtle bugs by optionally preventing date & > timestamp comparison > ------------------------------------------------------------------------------------------ > > Key: SPARK-32630 > URL: https://issues.apache.org/jira/browse/SPARK-32630 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Simeon Simeonov > Priority: Major > Labels: comparison, sql, timestamps > > https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of > date vs. timestamp comparison consistent with SQL, which, unfortunately, > isn't consistent with common sense. > When dates are compared with timestamps, they are promoted to timestamps at > midnight of the date, in the server timezone, which is almost always UTC. > This only works well if all timestamps in the data are logically time > instants as opposed to dates + times, which only become instants with a known > timezone. > The fundamental issue is that dates are a human time concept and instant are > a machine time concept. While we can technically promote one to the other, > logically, it only works 100% if midnight for all dates in the system is in > the server timezone. > Every major modern platform offers a clear distinction between machine time > (instants) and human time (an instant with a timezone, UTC offset, etc.), > because we have learned the hard way that date & time handling is a > never-ending source of confusion and bugs. SQL, being an ancient language > (40+ years old), is well behind software engineering best practices; using it > as a guiding light is necessary for Spark to win market share, but > unfortunate in every other way. > For example, Java has: > * java.time.LocalDate > * java.time.Instant > * java.time.ZonedDateTime > * java.time.OffsetDateTime > I am not suggesting we add new data types to Spark. I am suggesting we go to > the heart of the matter, which is that most date vs. time handling issues are > the result of confusion or carelessness. > What about introducing a new setting that makes comparisons between dates and > timestamps illegal, preferably with a helpful exception message? > If it existed, I would certainly make it the default for all our clusters. > The minor coding convenience that comes from being able to compare dates & > timestamps with an automatic type promotion pales in comparison with the risk > of subtle bugs that remain undetected for a long time. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org