Hi, We would like to solve the problem of interoperability of existing data, and that is the main use case for having table-level control. Spark should be able to read timestamps written by Impala or Hive and at the same time read back its own data. These have different semantics, so having a single flag is not enough.
Two separate types will solve this problem indeed, but only once every component involved supports them. Unfortunately, adding these separate SQL types is a larger effort that is only feasible in the long term and we would like to provide a short-term solution for interoperability in the meantime. Br, Zoltan On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <r...@databricks.com> wrote: > Yea I don't see why this needs to be per table config. If the user wants > to configure it per table, can't they just declare the data type on a per > table basis, once we have separate types for timestamp w/ tz and w/o tz? > > On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <mich...@videoamp.com> > wrote: > >> I would suggest that making timestamp type behavior configurable and >> persisted per-table could introduce some real confusion, e.g. in queries >> involving tables with different timestamp type semantics. >> >> I suggest starting with the assumption that timestamp type behavior is a >> per-session flag that can be set in a global `spark-defaults.conf` and >> consider more granular levels of configuration as people identify solid use >> cases. >> >> Cheers, >> >> Michael >> >> >> >> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com> wrote: >> >> Hi, >> >> If I remember correctly, the TIMESTAMP type had UTC-normalized local time >> semantics even before Spark 2, so I can understand that Spark considers it >> to be the "established" behavior that must not be broken. Unfortunately, >> this behavior does not provide interoperability with other SQL engines of >> the Hadoop stack. >> >> Let me summarize the findings of this e-mail thread so far: >> >> - Timezone-agnostic TIMESTAMP semantics would be beneficial for >> interoperability and SQL compliance. >> - Spark can not make a breaking change. For backward-compatibility >> with existing data, timestamp semantics should be user-configurable on a >> per-table level. >> >> Before going into the specifics of a possible solution, do we all agree >> on these points? >> >> Thanks, >> >> Zoltan >> >> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com> >> wrote: >> >>> I had asked zoltan to bring this discussion to the dev list because I >>> think it's a question that extends beyond a single jira (we can't figure >>> out the semantics of timestamp in parquet if we don't k ow the overall goal >>> of the timestamp type) and since its a design question the entire community >>> should be involved. >>> >>> I think that a lot of the confusion comes because we're talking about >>> different ways time zone affect behavior: (1) parsing and (2) behavior when >>> changing time zones for processing data. >>> >>> It seems we agree that spark should eventually provide a timestamp type >>> which does conform to the standard. The question is, how do we get >>> there? Has spark already broken compliance so much that it's impossible to >>> go back without breaking user behavior? Or perhaps spark already has >>> inconsistent behavior / broken compatibility within the 2.x line, so its >>> not unthinkable to have another breaking change? >>> >>> (Another part of the confusion is on me -- I believed the behavior >>> change was in 2.2, but actually it looks like its in 2.0.1. That changes >>> how we think about this in context of what goes into a 2.2 >>> release. SPARK-18350 isn't the origin of the difference in behavior.) >>> >>> First: consider processing data that is already stored in tables, and >>> then accessing it from machines in different time zones. The standard is >>> clear that "timestamp" should be just like "timestamp without time zone": >>> it does not represent one instant in time, rather it's always displayed the >>> same, regardless of time zone. This was the behavior in spark 2.0.0 (and >>> 1.6), for hive tables stored as text files, and for spark's json formats. >>> >>> Spark 2.0.1 changed the behavior of the json format (I believe >>> with SPARK-16216), so that it behaves more like timestamp *with* time >>> zone. It also makes csv behave the same (timestamp in csv was basically >>> broken in 2.0.0). However it did *not* change the behavior of a hive >>> textfile; it still behaves like "timestamp with*out* time zone". Here's >>> some experiments I tried -- there are a bunch of files there for >>> completeness, but mostly focus on the difference between >>> query_output_2_0_0.txt vs. query_output_2_0_1.txt >>> >>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 >>> >>> Given that spark has changed this behavior post 2.0.0, is it still out >>> of the question to change this behavior to bring it back in line with the >>> sql standard for timestamp (without time zone) in the 2.x line? Or, as >>> reynold proposes, is the only option at this point to add an off-by-default >>> feature flag to get "timestamp without time zone" semantics? >>> >>> >>> Second, there is the question of parsing strings into timestamp type. >>> I'm far less knowledgeable about this, so I mostly just have questions: >>> >>> * does the standard dictate what the parsing behavior should be for >>> timestamp (without time zone) when a time zone is present? >>> >>> * if it does and spark violates this standard is it worth trying to >>> retain the *other* semantics of timestamp without time zone, even if we >>> violate the parsing part? >>> >>> I did look at what postgres does for comparison: >>> >>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c >>> >>> spark's timestamp certainly does not match postgres's timestamp for >>> parsing, it seems closer to postgres's "timestamp with timezone" -- though >>> I dunno if that is standard behavior at all. >>> >>> thanks, >>> Imran >>> >>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> That's just my point 4, isn't it? >>>> >>>> >>>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io> >>>> wrote: >>>> >>>>> Reynold, >>>>> my point is that Spark should aim to follow the SQL standard instead >>>>> of rolling its own type system. >>>>> If I understand correctly, the existing implementation is similar to >>>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. >>>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH >>>>> TIMEZONE data types which are missing from Spark. >>>>> So, it is better (for me) if instead of extending the existing types, >>>>> Spark would just implement the additional well-defined types properly. >>>>> Just trying to copy-paste CREATE TABLE between SQL engines should not >>>>> be an exercise of flags and incompatibilities. >>>>> >>>>> Regarding the current behaviour, if I remember correctly I had to >>>>> force our spark O/S user into UTC so Spark wont change my timestamps. >>>>> >>>>> Ofir Manor >>>>> >>>>> Co-Founder & CTO | Equalum >>>>> >>>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io >>>>> >>>>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Zoltan, >>>>>> >>>>>> Thanks for raising this again, although I'm a bit confused since I've >>>>>> communicated with you a few times on JIRA and on private emails to >>>>>> explain >>>>>> that you have some misunderstanding of the timestamp type in Spark and >>>>>> some >>>>>> of your statements are wrong (e.g. the except text file part). Not sure >>>>>> why >>>>>> you didn't get any of those. >>>>>> >>>>>> >>>>>> Here's another try: >>>>>> >>>>>> >>>>>> 1. I think you guys misunderstood the semantics of timestamp in Spark >>>>>> before session local timezone change. IIUC, Spark has always assumed >>>>>> timestamps to be with timezone, since it parses timestamps with timezone >>>>>> and does all the datetime conversions with timezone in mind (it doesn't >>>>>> ignore timezone if a timestamp string has timezone specified). The >>>>>> session >>>>>> local timezone change further pushes Spark to that direction, but the >>>>>> semantics has been with timezone before that change. Just run Spark on >>>>>> machines with different timezone and you will know what I'm talking >>>>>> about. >>>>>> >>>>>> 2. CSV/Text is not different. The data type has always been "with >>>>>> timezone". If you put a timezone in the timestamp string, it parses the >>>>>> timezone. >>>>>> >>>>>> 3. We can't change semantics now, because it'd break all existing >>>>>> Spark apps. >>>>>> >>>>>> 4. We can however introduce a new timestamp without timezone type, >>>>>> and have a config flag to specify which one (with tz or without tz) is >>>>>> the >>>>>> default behavior. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Sorry if you receive this mail twice, it seems that my first attempt >>>>>>> did not make it to the list for some reason. >>>>>>> >>>>>>> I would like to start a discussion about SPARK-18350 >>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets >>>>>>> released because it seems to be going in a different direction than what >>>>>>> other SQL engines of the Hadoop stack do. >>>>>>> >>>>>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT >>>>>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that >>>>>>> expresses readings from calendars and clocks and is unaffected by time >>>>>>> zone. In the Hadoop stack, Impala has always worked like this and >>>>>>> recently >>>>>>> Presto also took steps >>>>>>> <https://github.com/prestodb/presto/issues/7122> to become >>>>>>> standards compliant. (Presto's design doc >>>>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> >>>>>>> also contains a great summary of the different semantics.) Hive has a >>>>>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major >>>>>>> source of incompatibility that is already being addressed >>>>>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in >>>>>>> SparkSQL, however, has UTC-normalized local time semantics (except for >>>>>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME >>>>>>> ZONE >>>>>>> type. >>>>>>> >>>>>>> Given that timezone-agnostic TIMESTAMP semantics provide standards >>>>>>> compliance and consistency with most SQL engines, I was wondering >>>>>>> whether >>>>>>> SparkSQL should also consider it in order to become ANSI SQL compliant >>>>>>> and >>>>>>> interoperable with other SQL engines of the Hadoop stack. Should >>>>>>> SparkSQL >>>>>>> adapt this semantics in the future, SPARK-18350 >>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to >>>>>>> be a source of problems. Please correct me if I'm wrong, but this change >>>>>>> seems to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the >>>>>>> TIMESTAMP type. I think SPARK-18350 would be a great feature for a >>>>>>> separate >>>>>>> TIMESTAMP WITH TIME ZONE type, but the plain unqualified TIMESTAMP type >>>>>>> would be better becoming timezone-agnostic instead of gaining further >>>>>>> timezone-aware capabilities. (Of course becoming timezone-agnostic >>>>>>> would be >>>>>>> a behavior change, so it must be optional and configurable by the user, >>>>>>> as >>>>>>> in Presto.) >>>>>>> >>>>>>> I would like to hear your opinions about this concern and about >>>>>>> TIMESTAMP semantics in general. Does the community agree that a >>>>>>> standards-compliant and interoperable TIMESTAMP type is desired? Do you >>>>>>> perceive SPARK-18350 as a potential problem in achieving this or do I >>>>>>> misunderstand the effects of this change? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Zoltan >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> List of links in case in-line links do not work: >>>>>>> >>>>>>> - SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 >>>>>>> - Presto's change: https://github.com/prestodb/presto/issues/7122 >>>>>>> - Presto's design doc: >>>>>>> >>>>>>> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >