Re: SQL TIMESTAMP semantics vs. SPARK-18350

Zoltan Ivanfi Mon, 12 Jun 2017 07:38:12 -0700

Hi,

I would like to hear the community's opinion about the short-term solution
I described in my previous mail. Do you agree with it or are there any
objections? If it sounds goods to everyone, I will prepare a more detailed
design doc describing the suggested behavior.


Thanks,

Zoltan

On Tue, Jun 6, 2017 at 10:34 PM Zoltan Ivanfi <z...@cloudera.com> wrote:

> Hi Michael,
>
> To answer this I think we should distinguish between the long-term fix and
> the short-term fix.
>
> If understand the replies correctly, everyone agrees that the desired
> long-term fix is to have two separate SQL types (TIMESTAMP [WITH|WITHOUT]
> TIME ZONE). Because of having separate types, mixing them as you described
> can not happen (unless a new feature intentionally allows that). Of course,
> conversions are still needed, but there are many examples from different
> database systems that we can follow.
>
> Since having two separate types is a huge effort, for a short term
> solution I would suggest allowing the single existing TIMESTAMP type to
> allow both semantics, configurable per table. The implementation of
> timezone-agnostic semantics could be similar to Hive. In Hive, just like in
> Spark, a timestamp is UTC-normalized internally but it is shown as a local
> time when it gets displayed. To achieve timezone-agnostic behavior, Hive
> still uses UTC-based timestamps in memory and adjusts on-disk data
> to/from this internal representation if needed. When the on-disk data is
> UTC-normalized as well, it matches this internal representation, so the
> on-disk value directly corresponds to the UTC instant of the in-memory
> representation.
>
> When the on-disk data is supposed to have timezone-agnostic semantics, the
> on-disk value is made to match the local time value of the in-memory
> timestamp, so the value that ultimately gets displayed to the user has
> timezone-agnostic semantics (although the corresponding UTC value will be
> different depending on the local time zone). So instead of implementing a
> separate in-memory representation for timezone-agnostic timestamps, the
> desired on-disk semantics are simulated on top of the existing
> representation. Timestamps are adjusted during reading/writing as needed.
>
> Implementing this workaround takes a lot less effort and simplifies some
> scenarios as well. For example, the situation that you described (union of
> two queries returning timestamps of different semantics) does not have to
> be handled explicitly, since the in-memory representation are the same,
> including their interpretation. Semantics only matter when reading/writing
> timestamps from/to disk.
>
> A disadvantage of this workaround is that it is not perfect. In most time
> zones, there is an hour skipped by the DST change every year.
> Timezone-agnostic timestamps from that single hour can not be emulated this
> way, because they are invalid in the local timezone, so there is no UTC
> instant that would ultimately get displayed as the desired timestamp. But
> that only affects ~0.01% of all timestamps and adapting this workaround
> would allow interoperability with 99.99% of timezone-agnostic timestamps
> written by Impala and Hive instead of the current situation in which 0% of
> these timestamps are interpreted correctly.
>
> Please let me know if some parts of my description were unclear and I will
> gladly elaborate on them.
>
> Thanks,
>
> Zoltan
>
> On Fri, Jun 2, 2017 at 9:41 PM Michael Allman <mich...@videoamp.com>
> wrote:
>
>> Hi Zoltan,
>>
>> I don't fully understand your proposal for table-specific timestamp type
>> semantics. I think it will be helpful to everyone in this conversation if
>> you can identify the expected behavior for a few concrete scenarios.
>>
>> Suppose we have a Hive metastore table hivelogs with a column named ts
>> with the hive timestamp type as described here:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp.
>> This table was created by Hive and is usually accessed through Hive or
>> Presto.
>>
>> Suppose again we have a Hive metastore table sparklogs with a column
>> named ts with the Spark SQL timestamp type as described here:
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$.
>> This table was created by Spark SQL and is usually accessed through Spark
>> SQL.
>>
>> Let's say Spark SQL sets and reads a table property called
>> timestamp_interp to determine timestamp type semantics for that table.
>> Consider a dataframe df defined by sql("SELECT sts as ts FROM sparklogs
>> UNION ALL SELECT hts as ts FROM hivelogs"). Suppose the timestamp_interp
>> table property is absent from hivelogs. For each possible value of
>> timestamp_interp set on the table sparklogs,
>>
>> 1. does df successfully pass analysis (i.e. is it a valid query)?
>> 2. if it's a valid dataframe, what is the type of the ts column?
>> 3. if it's a valid dataframe, what are the semantics of the type of the
>> ts column?
>>
>> Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can
>> you answer the same three questions for each combination of
>> timestamp_interp on hivelogs and sparklogs?
>>
>> Thank you.
>>
>> Michael
>>
>>
>> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>>
>> Hi,
>>
>> We would like to solve the problem of interoperability of existing data,
>> and that is the main use case for having table-level control. Spark should
>> be able to read timestamps written by Impala or Hive and at the same time
>> read back its own data. These have different semantics, so having a single
>> flag is not enough.
>>
>> Two separate types will solve this problem indeed, but only once every
>> component involved supports them. Unfortunately, adding these separate SQL
>> types is a larger effort that is only feasible in the long term and we
>> would like to provide a short-term solution for interoperability in the
>> meantime.
>>
>> Br,
>>
>> Zoltan
>>
>> On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Yea I don't see why this needs to be per table config. If the user wants
>>> to configure it per table, can't they just declare the data type on a per
>>> table basis, once we have separate types for timestamp w/ tz and w/o tz?
>>>
>>> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <mich...@videoamp.com>
>>> wrote:
>>>
>>>> I would suggest that making timestamp type behavior configurable and
>>>> persisted per-table could introduce some real confusion, e.g. in queries
>>>> involving tables with different timestamp type semantics.
>>>>
>>>> I suggest starting with the assumption that timestamp type behavior is
>>>> a per-session flag that can be set in a global `spark-defaults.conf` and
>>>> consider more granular levels of configuration as people identify solid use
>>>> cases.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> If I remember correctly, the TIMESTAMP type had UTC-normalized local
>>>> time semantics even before Spark 2, so I can understand that Spark
>>>> considers it to be the "established" behavior that must not be broken.
>>>> Unfortunately, this behavior does not provide interoperability with other
>>>> SQL engines of the Hadoop stack.
>>>>
>>>> Let me summarize the findings of this e-mail thread so far:
>>>>
>>>>    - Timezone-agnostic TIMESTAMP semantics would be beneficial for
>>>>    interoperability and SQL compliance.
>>>>    - Spark can not make a breaking change. For backward-compatibility
>>>>    with existing data, timestamp semantics should be user-configurable on a
>>>>    per-table level.
>>>>
>>>> Before going into the specifics of a possible solution, do we all agree
>>>> on these points?
>>>>
>>>> Thanks,
>>>>
>>>> Zoltan
>>>>
>>>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com>
>>>> wrote:
>>>>
>>>>> I had asked zoltan to bring this discussion to the dev list because I
>>>>> think it's a question that extends beyond a single jira (we can't figure
>>>>> out the semantics of timestamp in parquet if we don't k ow the overall 
>>>>> goal
>>>>> of the timestamp type) and since its a design question the entire 
>>>>> community
>>>>> should be involved.
>>>>>
>>>>> I think that a lot of the confusion comes because we're talking about
>>>>> different ways time zone affect behavior: (1) parsing and (2) behavior 
>>>>> when
>>>>> changing time zones for processing data.
>>>>>
>>>>> It seems we agree that spark should eventually provide a timestamp
>>>>> type which does conform to the standard.   The question is, how do we get
>>>>> there?  Has spark already broken compliance so much that it's impossible 
>>>>> to
>>>>> go back without breaking user behavior?  Or perhaps spark already has
>>>>> inconsistent behavior / broken compatibility within the 2.x line, so its
>>>>> not unthinkable to have another breaking change?
>>>>>
>>>>> (Another part of the confusion is on me -- I believed the behavior
>>>>> change was in 2.2, but actually it looks like its in 2.0.1.  That changes
>>>>> how we think about this in context of what goes into a 2.2
>>>>> release.  SPARK-18350 isn't the origin of the difference in behavior.)
>>>>>
>>>>> First: consider processing data that is already stored in tables, and
>>>>> then accessing it from machines in different time zones.  The standard is
>>>>> clear that "timestamp" should be just like "timestamp without time zone":
>>>>> it does not represent one instant in time, rather it's always displayed 
>>>>> the
>>>>> same, regardless of time zone.  This was the behavior in spark 2.0.0 (and
>>>>> 1.6),  for hive tables stored as text files, and for spark's json formats.
>>>>>
>>>>> Spark 2.0.1  changed the behavior of the json format (I believe
>>>>> with SPARK-16216), so that it behaves more like timestamp *with* time
>>>>> zone.  It also makes csv behave the same (timestamp in csv was basically
>>>>> broken in 2.0.0).  However it did *not* change the behavior of a hive
>>>>> textfile; it still behaves like "timestamp with*out* time zone".  Here's
>>>>> some experiments I tried -- there are a bunch of files there for
>>>>> completeness, but mostly focus on the difference between
>>>>> query_output_2_0_0.txt vs. query_output_2_0_1.txt
>>>>>
>>>>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70
>>>>>
>>>>> Given that spark has changed this behavior post 2.0.0, is it still out
>>>>> of the question to change this behavior to bring it back in line with the
>>>>> sql standard for timestamp (without time zone) in the 2.x line?  Or, as
>>>>> reynold proposes, is the only option at this point to add an 
>>>>> off-by-default
>>>>> feature flag to get "timestamp without time zone" semantics?
>>>>>
>>>>>
>>>>> Second, there is the question of parsing strings into timestamp type.
>>>>>   I'm far less knowledgeable about this, so I mostly just have questions:
>>>>>
>>>>> * does the standard dictate what the parsing behavior should be for
>>>>> timestamp (without time zone) when a time zone is present?
>>>>>
>>>>> * if it does and spark violates this standard is it worth trying to
>>>>> retain the *other* semantics of timestamp without time zone, even if we
>>>>> violate the parsing part?
>>>>>
>>>>> I did look at what postgres does for comparison:
>>>>>
>>>>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c
>>>>>
>>>>> spark's timestamp certainly does not match postgres's timestamp for
>>>>> parsing, it seems closer to postgres's "timestamp with timezone" -- though
>>>>> I dunno if that is standard behavior at all.
>>>>>
>>>>> thanks,
>>>>> Imran
>>>>>
>>>>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> That's just my point 4, isn't it?
>>>>>>
>>>>>>
>>>>>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Reynold,
>>>>>>> my point is that Spark should aim to follow the SQL standard instead
>>>>>>> of rolling its own type system.
>>>>>>> If I understand correctly, the existing implementation is similar to
>>>>>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>>>>>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH
>>>>>>> TIMEZONE data types which are missing from Spark.
>>>>>>> So, it is better (for me) if instead of extending the existing
>>>>>>> types, Spark would just implement the additional well-defined types
>>>>>>> properly.
>>>>>>> Just trying to copy-paste CREATE TABLE between SQL engines should
>>>>>>> not be an exercise of flags and incompatibilities.
>>>>>>>
>>>>>>> Regarding the current behaviour, if I remember correctly I had to
>>>>>>> force our spark O/S user into UTC so Spark wont change my timestamps.
>>>>>>>
>>>>>>> Ofir Manor
>>>>>>>
>>>>>>> Co-Founder & CTO | Equalum
>>>>>>>
>>>>>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>>>>>
>>>>>>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Zoltan,
>>>>>>>>
>>>>>>>> Thanks for raising this again, although I'm a bit confused since
>>>>>>>> I've communicated with you a few times on JIRA and on private emails to
>>>>>>>> explain that you have some misunderstanding of the timestamp type in 
>>>>>>>> Spark
>>>>>>>> and some of your statements are wrong (e.g. the except text file 
>>>>>>>> part). Not
>>>>>>>> sure why you didn't get any of those.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here's another try:
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. I think you guys misunderstood the semantics of timestamp in
>>>>>>>> Spark before session local timezone change. IIUC, Spark has always 
>>>>>>>> assumed
>>>>>>>> timestamps to be with timezone, since it parses timestamps with 
>>>>>>>> timezone
>>>>>>>> and does all the datetime conversions with timezone in mind (it doesn't
>>>>>>>> ignore timezone if a timestamp string has timezone specified). The 
>>>>>>>> session
>>>>>>>> local timezone change further pushes Spark to that direction, but the
>>>>>>>> semantics has been with timezone before that change. Just run Spark on
>>>>>>>> machines with different timezone and you will know what I'm talking 
>>>>>>>> about.
>>>>>>>>
>>>>>>>> 2. CSV/Text is not different. The data type has always been "with
>>>>>>>> timezone". If you put a timezone in the timestamp string, it parses the
>>>>>>>> timezone.
>>>>>>>>
>>>>>>>> 3. We can't change semantics now, because it'd break all existing
>>>>>>>> Spark apps.
>>>>>>>>
>>>>>>>> 4. We can however introduce a new timestamp without timezone type,
>>>>>>>> and have a config flag to specify which one (with tz or without tz) is 
>>>>>>>> the
>>>>>>>> default behavior.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Sorry if you receive this mail twice, it seems that my first
>>>>>>>>> attempt did not make it to the list for some reason.
>>>>>>>>>
>>>>>>>>> I would like to start a discussion about SPARK-18350
>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it
>>>>>>>>> gets released because it seems to be going in a different direction 
>>>>>>>>> than
>>>>>>>>> what other SQL engines of the Hadoop stack do.
>>>>>>>>>
>>>>>>>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP
>>>>>>>>> WITHOUT TIME ZONE) to have timezone-agnostic semantics - basically a 
>>>>>>>>> type
>>>>>>>>> that expresses readings from calendars and clocks and is unaffected 
>>>>>>>>> by time
>>>>>>>>> zone. In the Hadoop stack, Impala has always worked like this and 
>>>>>>>>> recently
>>>>>>>>> Presto also took steps
>>>>>>>>> <https://github.com/prestodb/presto/issues/7122> to become
>>>>>>>>> standards compliant. (Presto's design doc
>>>>>>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>>>>>>>> also contains a great summary of the different semantics.) Hive has a
>>>>>>>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>>>>>>>> source of incompatibility that is already being addressed
>>>>>>>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP
>>>>>>>>> in SparkSQL, however, has UTC-normalized local time semantics (except 
>>>>>>>>> for
>>>>>>>>> textfile), which is generally the semantics of the TIMESTAMP WITH 
>>>>>>>>> TIME ZONE
>>>>>>>>> type.
>>>>>>>>>
>>>>>>>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>>>>>>>> compliance and consistency with most SQL engines, I was wondering 
>>>>>>>>> whether
>>>>>>>>> SparkSQL should also consider it in order to become ANSI SQL 
>>>>>>>>> compliant and
>>>>>>>>> interoperable with other SQL engines of the Hadoop stack. Should 
>>>>>>>>> SparkSQL
>>>>>>>>> adapt this semantics in the future, SPARK-18350
>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out
>>>>>>>>> to be a source of problems. Please correct me if I'm wrong, but this 
>>>>>>>>> change
>>>>>>>>> seems to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the
>>>>>>>>> TIMESTAMP type. I think SPARK-18350 would be a great feature for a 
>>>>>>>>> separate
>>>>>>>>> TIMESTAMP WITH TIME ZONE type, but the plain unqualified TIMESTAMP 
>>>>>>>>> type
>>>>>>>>> would be better becoming timezone-agnostic instead of gaining further
>>>>>>>>> timezone-aware capabilities. (Of course becoming timezone-agnostic 
>>>>>>>>> would be
>>>>>>>>> a behavior change, so it must be optional and configurable by the 
>>>>>>>>> user, as
>>>>>>>>> in Presto.)
>>>>>>>>>
>>>>>>>>> I would like to hear your opinions about this concern and about
>>>>>>>>> TIMESTAMP semantics in general. Does the community agree that a
>>>>>>>>> standards-compliant and interoperable TIMESTAMP type is desired? Do 
>>>>>>>>> you
>>>>>>>>> perceive SPARK-18350 as a potential problem in achieving this or do I
>>>>>>>>> misunderstand the effects of this change?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Zoltan
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> List of links in case in-line links do not work:
>>>>>>>>>
>>>>>>>>>    - SPARK-18350:
>>>>>>>>>    https://issues.apache.org/jira/browse/SPARK-18350
>>>>>>>>>    - Presto's change:
>>>>>>>>>    https://github.com/prestodb/presto/issues/7122
>>>>>>>>>    - Presto's design doc:
>>>>>>>>>    
>>>>>>>>> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Reply via email to