[
https://issues.apache.org/jira/browse/DRILL-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Turton updated DRILL-8101:
--------------------------------
Fix Version/s: (was: 1.21.0)
> Resolve the TIMESTAMP madness
> -----------------------------
>
> Key: DRILL-8101
> URL: https://issues.apache.org/jira/browse/DRILL-8101
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.19.0
> Reporter: Paul Rogers
> Priority: Major
>
> Drill's TIMESAMP type tries to be two different things at the same time,
> causing incorrect results when the two interpretations collide.
> Drill has the classic DATE and TIME data types. A DATE is just that: a day
> wherever you happen to be. Your birthday goes from midnight to midnight in
> the time zone where you find yourself. If you happen to travel around the
> world, you can make your birthday last almost 48 hours as midnight of your
> birthday starts at the international date line, circles the globe, followed
> by the midnight of the next day.
> Similarly, a time is a time where you are. 12:00PM is noon (more-or-less) as
> determined by the sun. 12:00PM occurs once in every time zone every day.
> Since there are many time zones, there are many noons each day.
> These are both examples of local time. Most databases combine these two ideas
> to get a DATETIME: a date and time wherever you are.
> In our modern world, knowing something occurred on 2022-01-02 12:00:00 is not
> good enough. Did it occur at that time in my time zone or yours? If the event
> is a user login, or a network breach, then it occurred once, at a specific
> time, it did not occur many times: once in each time zone. Hence, machines
> often use UTC time to coordinate.
> Unix-like systems also define the idea of a "timestamp", the number of
> seconds (or milliseconds or nanoseconds) since 1970-01-01 00:00:00. This is
> the time reported by Java in the {{System.currentTime()}} function. It is the
> time most often found in machine-generated logs. It may be as a number (ms
> since the epoch) or as an ISO-formatted string.
> Thus, users of Drill would expect to find a "timestamp" type that represents
> a UTC timestamp in Unix format. The will be disappointed, however.
> Drill's TIMESTAMP type is essentially a DATETIME type: it is a date/time in
> an unspecified timezone and that zone can be whatever you want it to be. UTC?
> Fine. Local? OK. Nairobi? Sure, why not.
> This works fine as long as _all_ your data is in the same time zone, and you
> don't need a concept of "now". As described in DRILL-8099 and DRILL-8100,
> this is how the authors of CTAS thought of it: read Parquet data straight
> into Drill with no conversion, then write it back out to JSON with no
> conversion. Both work with UTC, so the result is fine: who cares that the
> 32-bit number, when in Drill, had no implied time zone? It is just a number
> we read then write. All good.
> It is even possible to compute the difference of two DATETIMEs with
> unspecified time zone: that's what an INTERVAL does. As long as the times are
> actually in the same zone (UTC, say, or local, or Nairobi), then all is fine.
> Everything collapses, however, when someone wants to know, "but how long ago
> was that event"? "Long enough ago that I need to raise the escalation level?"
> Drill has the INTERVAL type to give us the difference, but how do I get
> "now"? Drill has {{CURRENT_TIMESTAMP}}. But, how we have a problem, what
> timezone is that time in? UTC? My local timezone? Nairobi? And, what if my
> data is UTC but {{CURRENT_TIMESTAMP}} is local? Or visa-versa? The whole
> house of cards comes crashing down.
> Over the years, this bug has appeared again and again. Sometimes people
> change the logic to assume TIMESTAMP is UTC. Sometimes things are changed to
> assume TIMESTAMP is local time (I've been guilty of this). Sometimes we just
> punt, and require that the machine (or test) run only in UTC, since that's
> the only place the two systems coincide.
> But, in fact, I believe that the original designers of Drill meant TIMESTAMP
> to have _no_ timezone: two TIMESTAMP values could be in entirely different
> (unknown) timezones! One can see vestiges of this in the value vector code.
> It seems the original engineers imagined a "TIMESTAMP_WITH_ZONE" type,
> similar to Java's (or Joda's) {{ZonedDateTime}} type. Other bits of code
> (Parquet) refers to a never-built "TIMESTAMPZ" type for a UTC timestamp. When
> faced with the {{CURRENT_TIMESTAMP}} issue, fixes started down the path of
> saying that TIMESTAMP is local time, but this is probably a misunderstanding
> of the original design, forced upon us by the gaps in that original design.
> Further, each time we make a change (such as DRILL-8099 and DRILL-8100), we
> change behavior, potentially breaking a kludge that someone found to
> kinda-sorta make things work.
> Since computers can't deal with ambiguity the way humans can, we need a
> solution. It is not good enough for you to think "TIMESTAMP is UTC" and me to
> think "TIMESTAMP is local" and for Bob to think "TIMESTAMP is Java's
> {{LocalDateTime}}, it has no zone." The software needs to work one way.
> (Software is a pain that way.)
> One way to solve the issue is to introduce two new types: DATETIME and
> UTC_TIMESTAMP. DATETIME has no timezone. It is not local time: it is an
> unknown, unspecified timezone, as in all other SQL tools. It is noon on your
> birthday, wherever you happen to be. UTC_TIMESTAMP is a Unix-like timestamp.
> We then need functions to convert. Here, we can just follow Java's
> {{java.time}} package: to compare a UTC_TIMESTAMP with a DATETIME, you have
> to specify a timezone.
> This will leave the current TIMESTAMP to settle down into a LOCAL_TIMESTAMP,
> which will turn out to be useless and will (we can hope) fade away.
> All readers and writers that work with data that is known to be UTC (such as
> Parquet) will use UTC_TIMESTAMP. Where the zone is not known (JDBC DATETIME),
> then the data type will be DATETIME.
> The functions that deal with the current time need new versions: "NOW" for
> "CURRENT_UTC_TIMESTAMP". The "AGE" function would use a UTC baseline for
> UTC_TIMESTAMP, perhaps local time for a DATETIME argument.
> A design should spell out the details. Since time handing is HARD, the design
> should lean heavily on the {{java.time}} package and just follow the enormous
> work that went into that package.
> After years of fighting this beast, I've come to the conclusion that only the
> two new types (and the deprecation and gradual fading away of the "local"
> TIMESTAMP type) can finally resolve our never-ending stream of bugs in this
> area.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)