Re: SQL TIMESTAMP semantics vs. SPARK-18350

Reynold Xin Thu, 01 Jun 2017 16:33:06 -0700

Yea I don't see why this needs to be per table config. If the user wants to
configure it per table, can't they just declare the data type on a per
table basis, once we have separate types for timestamp w/ tz and w/o tz?


On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <[email protected]> wrote:

> I would suggest that making timestamp type behavior configurable and
> persisted per-table could introduce some real confusion, e.g. in queries
> involving tables with different timestamp type semantics.
>
> I suggest starting with the assumption that timestamp type behavior is a
> per-session flag that can be set in a global `spark-defaults.conf` and
> consider more granular levels of configuration as people identify solid use
> cases.
>
> Cheers,
>
> Michael
>
>
>
> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <[email protected]> wrote:
>
> Hi,
>
> If I remember correctly, the TIMESTAMP type had UTC-normalized local time
> semantics even before Spark 2, so I can understand that Spark considers it
> to be the "established" behavior that must not be broken. Unfortunately,
> this behavior does not provide interoperability with other SQL engines of
> the Hadoop stack.
>
> Let me summarize the findings of this e-mail thread so far:
>
>    - Timezone-agnostic TIMESTAMP semantics would be beneficial for
>    interoperability and SQL compliance.
>    - Spark can not make a breaking change. For backward-compatibility
>    with existing data, timestamp semantics should be user-configurable on a
>    per-table level.
>
> Before going into the specifics of a possible solution, do we all agree on
> these points?
>
> Thanks,
>
> Zoltan
>
> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <[email protected]> wrote:
>
>> I had asked zoltan to bring this discussion to the dev list because I
>> think it's a question that extends beyond a single jira (we can't figure
>> out the semantics of timestamp in parquet if we don't k ow the overall goal
>> of the timestamp type) and since its a design question the entire community
>> should be involved.
>>
>> I think that a lot of the confusion comes because we're talking about
>> different ways time zone affect behavior: (1) parsing and (2) behavior when
>> changing time zones for processing data.
>>
>> It seems we agree that spark should eventually provide a timestamp type
>> which does conform to the standard.   The question is, how do we get
>> there?  Has spark already broken compliance so much that it's impossible to
>> go back without breaking user behavior?  Or perhaps spark already has
>> inconsistent behavior / broken compatibility within the 2.x line, so its
>> not unthinkable to have another breaking change?
>>
>> (Another part of the confusion is on me -- I believed the behavior change
>> was in 2.2, but actually it looks like its in 2.0.1.  That changes how we
>> think about this in context of what goes into a 2.2 release.  SPARK-18350
>> isn't the origin of the difference in behavior.)
>>
>> First: consider processing data that is already stored in tables, and
>> then accessing it from machines in different time zones.  The standard is
>> clear that "timestamp" should be just like "timestamp without time zone":
>> it does not represent one instant in time, rather it's always displayed the
>> same, regardless of time zone.  This was the behavior in spark 2.0.0 (and
>> 1.6),  for hive tables stored as text files, and for spark's json formats.
>>
>> Spark 2.0.1  changed the behavior of the json format (I believe
>> with SPARK-16216), so that it behaves more like timestamp *with* time
>> zone.  It also makes csv behave the same (timestamp in csv was basically
>> broken in 2.0.0).  However it did *not* change the behavior of a hive
>> textfile; it still behaves like "timestamp with*out* time zone".  Here's
>> some experiments I tried -- there are a bunch of files there for
>> completeness, but mostly focus on the difference between
>> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>>
>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>>
>> Given that spark has changed this behavior post 2.0.0, is it still out of
>> the question to change this behavior to bring it back in line with the sql
>> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold
>> proposes, is the only option at this point to add an off-by-default feature
>> flag to get "timestamp without time zone" semantics?
>>
>>
>> Second, there is the question of parsing strings into timestamp type.
>> I'm far less knowledgeable about this, so I mostly just have questions:
>>
>> * does the standard dictate what the parsing behavior should be for
>> timestamp (without time zone) when a time zone is present?
>>
>> * if it does and spark violates this standard is it worth trying to
>> retain the *other* semantics of timestamp without time zone, even if we
>> violate the parsing part?
>>
>> I did look at what postgres does for comparison:
>>
>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c
>>
>> spark's timestamp certainly does not match postgres's timestamp for
>> parsing, it seems closer to postgres's "timestamp with timezone" -- though
>> I dunno if that is standard behavior at all.
>>
>> thanks,
>> Imran
>>
>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <[email protected]> wrote:
>>
>>> That's just my point 4, isn't it?
>>>
>>>
>>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <[email protected]>
>>> wrote:
>>>
>>>> Reynold,
>>>> my point is that Spark should aim to follow the SQL standard instead of
>>>> rolling its own type system.
>>>> If I understand correctly, the existing implementation is similar to
>>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH
>>>> TIMEZONE data types which are missing from Spark.
>>>> So, it is better (for me) if instead of extending the existing types,
>>>> Spark would just implement the additional well-defined types properly.
>>>> Just trying to copy-paste CREATE TABLE between SQL engines should not
>>>> be an exercise of flags and incompatibilities.
>>>>
>>>> Regarding the current behaviour, if I remember correctly I had to force
>>>> our spark O/S user into UTC so Spark wont change my timestamps.
>>>>
>>>> Ofir Manor
>>>>
>>>> Co-Founder & CTO | Equalum
>>>>
>>>> Mobile: +972-54-7801286 | Email: [email protected]
>>>>
>>>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <[email protected]>
>>>> wrote:
>>>>
>>>>> Zoltan,
>>>>>
>>>>> Thanks for raising this again, although I'm a bit confused since I've
>>>>> communicated with you a few times on JIRA and on private emails to explain
>>>>> that you have some misunderstanding of the timestamp type in Spark and 
>>>>> some
>>>>> of your statements are wrong (e.g. the except text file part). Not sure 
>>>>> why
>>>>> you didn't get any of those.
>>>>>
>>>>>
>>>>> Here's another try:
>>>>>
>>>>>
>>>>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>>>>> before session local timezone change. IIUC, Spark has always assumed
>>>>> timestamps to be with timezone, since it parses timestamps with timezone
>>>>> and does all the datetime conversions with timezone in mind (it doesn't
>>>>> ignore timezone if a timestamp string has timezone specified). The session
>>>>> local timezone change further pushes Spark to that direction, but the
>>>>> semantics has been with timezone before that change. Just run Spark on
>>>>> machines with different timezone and you will know what I'm talking about.
>>>>>
>>>>> 2. CSV/Text is not different. The data type has always been "with
>>>>> timezone". If you put a timezone in the timestamp string, it parses the
>>>>> timezone.
>>>>>
>>>>> 3. We can't change semantics now, because it'd break all existing
>>>>> Spark apps.
>>>>>
>>>>> 4. We can however introduce a new timestamp without timezone type, and
>>>>> have a config flag to specify which one (with tz or without tz) is the
>>>>> default behavior.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Sorry if you receive this mail twice, it seems that my first attempt
>>>>>> did not make it to the list for some reason.
>>>>>>
>>>>>> I would like to start a discussion about SPARK-18350
>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
>>>>>> released because it seems to be going in a different direction than what
>>>>>> other SQL engines of the Hadoop stack do.
>>>>>>
>>>>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>>>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>>>>> expresses readings from calendars and clocks and is unaffected by time
>>>>>> zone. In the Hadoop stack, Impala has always worked like this and 
>>>>>> recently
>>>>>> Presto also took steps
>>>>>> <https://github.com/prestodb/presto/issues/7122> to become standards
>>>>>> compliant. (Presto's design doc
>>>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>>>> also contains a great summary of the different semantics.) Hive has a
>>>>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>>>>> source of incompatibility that is already being addressed
>>>>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
>>>>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>>>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME 
>>>>>> ZONE
>>>>>> type.
>>>>>>
>>>>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>>>>> compliance and consistency with most SQL engines, I was wondering whether
>>>>>> SparkSQL should also consider it in order to become ANSI SQL compliant 
>>>>>> and
>>>>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>>>>> adapt this semantics in the future, SPARK-18350
>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to
>>>>>> be a source of problems. Please correct me if I'm wrong, but this change
>>>>>> seems to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the
>>>>>> TIMESTAMP type. I think SPARK-18350 would be a great feature for a 
>>>>>> separate
>>>>>> TIMESTAMP WITH TIME ZONE type, but the plain unqualified TIMESTAMP type
>>>>>> would be better becoming timezone-agnostic instead of gaining further
>>>>>> timezone-aware capabilities. (Of course becoming timezone-agnostic would 
>>>>>> be
>>>>>> a behavior change, so it must be optional and configurable by the user, 
>>>>>> as
>>>>>> in Presto.)
>>>>>>
>>>>>> I would like to hear your opinions about this concern and about
>>>>>> TIMESTAMP semantics in general. Does the community agree that a
>>>>>> standards-compliant and interoperable TIMESTAMP type is desired? Do you
>>>>>> perceive SPARK-18350 as a potential problem in achieving this or do I
>>>>>> misunderstand the effects of this change?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> List of links in case in-line links do not work:
>>>>>>
>>>>>>    - SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>>>>>>    - Presto's change: https://github.com/prestodb/presto/issues/7122
>>>>>>    - Presto's design doc: https://docs.google.com/document/d/
>>>>>>    1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>>>>>    
>>>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Reply via email to