Hi,

Sorry if you receive this mail twice, it seems that my first attempt did
not make it to the list for some reason.

I would like to start a discussion about SPARK-18350
<https://issues.apache.org/jira/browse/SPARK-18350> before it gets released
because it seems to be going in a different direction than what other SQL
engines of the Hadoop stack do.

ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
ZONE) to have timezone-agnostic semantics - basically a type that expresses
readings from calendars and clocks and is unaffected by time zone. In the
Hadoop stack, Impala has always worked like this and recently Presto also
took steps <https://github.com/prestodb/presto/issues/7122> to become
standards compliant. (Presto's design doc
<https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
also contains a great summary of the different semantics.) Hive has a
timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
source of incompatibility that is already being addressed
<https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
SparkSQL, however, has UTC-normalized local time semantics (except for
textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
type.

Given that timezone-agnostic TIMESTAMP semantics provide standards
compliance and consistency with most SQL engines, I was wondering whether
SparkSQL should also consider it in order to become ANSI SQL compliant and
interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
adapt this semantics in the future, SPARK-18350
<https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a
source of problems. Please correct me if I'm wrong, but this change seems
to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
better becoming timezone-agnostic instead of gaining further timezone-aware
capabilities. (Of course becoming timezone-agnostic would be a behavior
change, so it must be optional and configurable by the user, as in Presto.)

I would like to hear your opinions about this concern and about TIMESTAMP
semantics in general. Does the community agree that a standards-compliant
and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
a potential problem in achieving this or do I misunderstand the effects of
this change?

Thanks,

Zoltan

---

List of links in case in-line links do not work:

   -

   SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
   -

   Presto's change: https://github.com/prestodb/presto/issues/7122
   -

   Presto's design doc:
   
https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit

Reply via email to