Hi, I would like to hear the community's opinion about the short-term solution I described in my previous mail. Do you agree with it or are there any objections? If it sounds goods to everyone, I will prepare a more detailed design doc describing the suggested behavior.
Thanks, Zoltan On Tue, Jun 6, 2017 at 10:34 PM Zoltan Ivanfi <z...@cloudera.com> wrote: > Hi Michael, > > To answer this I think we should distinguish between the long-term fix and > the short-term fix. > > If understand the replies correctly, everyone agrees that the desired > long-term fix is to have two separate SQL types (TIMESTAMP [WITH|WITHOUT] > TIME ZONE). Because of having separate types, mixing them as you described > can not happen (unless a new feature intentionally allows that). Of course, > conversions are still needed, but there are many examples from different > database systems that we can follow. > > Since having two separate types is a huge effort, for a short term > solution I would suggest allowing the single existing TIMESTAMP type to > allow both semantics, configurable per table. The implementation of > timezone-agnostic semantics could be similar to Hive. In Hive, just like in > Spark, a timestamp is UTC-normalized internally but it is shown as a local > time when it gets displayed. To achieve timezone-agnostic behavior, Hive > still uses UTC-based timestamps in memory and adjusts on-disk data > to/from this internal representation if needed. When the on-disk data is > UTC-normalized as well, it matches this internal representation, so the > on-disk value directly corresponds to the UTC instant of the in-memory > representation. > > When the on-disk data is supposed to have timezone-agnostic semantics, the > on-disk value is made to match the local time value of the in-memory > timestamp, so the value that ultimately gets displayed to the user has > timezone-agnostic semantics (although the corresponding UTC value will be > different depending on the local time zone). So instead of implementing a > separate in-memory representation for timezone-agnostic timestamps, the > desired on-disk semantics are simulated on top of the existing > representation. Timestamps are adjusted during reading/writing as needed. > > Implementing this workaround takes a lot less effort and simplifies some > scenarios as well. For example, the situation that you described (union of > two queries returning timestamps of different semantics) does not have to > be handled explicitly, since the in-memory representation are the same, > including their interpretation. Semantics only matter when reading/writing > timestamps from/to disk. > > A disadvantage of this workaround is that it is not perfect. In most time > zones, there is an hour skipped by the DST change every year. > Timezone-agnostic timestamps from that single hour can not be emulated this > way, because they are invalid in the local timezone, so there is no UTC > instant that would ultimately get displayed as the desired timestamp. But > that only affects ~0.01% of all timestamps and adapting this workaround > would allow interoperability with 99.99% of timezone-agnostic timestamps > written by Impala and Hive instead of the current situation in which 0% of > these timestamps are interpreted correctly. > > Please let me know if some parts of my description were unclear and I will > gladly elaborate on them. > > Thanks, > > Zoltan > > On Fri, Jun 2, 2017 at 9:41 PM Michael Allman <mich...@videoamp.com> > wrote: > >> Hi Zoltan, >> >> I don't fully understand your proposal for table-specific timestamp type >> semantics. I think it will be helpful to everyone in this conversation if >> you can identify the expected behavior for a few concrete scenarios. >> >> Suppose we have a Hive metastore table hivelogs with a column named ts >> with the hive timestamp type as described here: >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp. >> This table was created by Hive and is usually accessed through Hive or >> Presto. >> >> Suppose again we have a Hive metastore table sparklogs with a column >> named ts with the Spark SQL timestamp type as described here: >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$. >> This table was created by Spark SQL and is usually accessed through Spark >> SQL. >> >> Let's say Spark SQL sets and reads a table property called >> timestamp_interp to determine timestamp type semantics for that table. >> Consider a dataframe df defined by sql("SELECT sts as ts FROM sparklogs >> UNION ALL SELECT hts as ts FROM hivelogs"). Suppose the timestamp_interp >> table property is absent from hivelogs. For each possible value of >> timestamp_interp set on the table sparklogs, >> >> 1. does df successfully pass analysis (i.e. is it a valid query)? >> 2. if it's a valid dataframe, what is the type of the ts column? >> 3. if it's a valid dataframe, what are the semantics of the type of the >> ts column? >> >> Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can >> you answer the same three questions for each combination of >> timestamp_interp on hivelogs and sparklogs? >> >> Thank you. >> >> Michael >> >> >> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi <z...@cloudera.com> wrote: >> >> Hi, >> >> We would like to solve the problem of interoperability of existing data, >> and that is the main use case for having table-level control. Spark should >> be able to read timestamps written by Impala or Hive and at the same time >> read back its own data. These have different semantics, so having a single >> flag is not enough. >> >> Two separate types will solve this problem indeed, but only once every >> component involved supports them. Unfortunately, adding these separate SQL >> types is a larger effort that is only feasible in the long term and we >> would like to provide a short-term solution for interoperability in the >> meantime. >> >> Br, >> >> Zoltan >> >> On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <r...@databricks.com> wrote: >> >>> Yea I don't see why this needs to be per table config. If the user wants >>> to configure it per table, can't they just declare the data type on a per >>> table basis, once we have separate types for timestamp w/ tz and w/o tz? >>> >>> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <mich...@videoamp.com> >>> wrote: >>> >>>> I would suggest that making timestamp type behavior configurable and >>>> persisted per-table could introduce some real confusion, e.g. in queries >>>> involving tables with different timestamp type semantics. >>>> >>>> I suggest starting with the assumption that timestamp type behavior is >>>> a per-session flag that can be set in a global `spark-defaults.conf` and >>>> consider more granular levels of configuration as people identify solid use >>>> cases. >>>> >>>> Cheers, >>>> >>>> Michael >>>> >>>> >>>> >>>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com> wrote: >>>> >>>> Hi, >>>> >>>> If I remember correctly, the TIMESTAMP type had UTC-normalized local >>>> time semantics even before Spark 2, so I can understand that Spark >>>> considers it to be the "established" behavior that must not be broken. >>>> Unfortunately, this behavior does not provide interoperability with other >>>> SQL engines of the Hadoop stack. >>>> >>>> Let me summarize the findings of this e-mail thread so far: >>>> >>>> - Timezone-agnostic TIMESTAMP semantics would be beneficial for >>>> interoperability and SQL compliance. >>>> - Spark can not make a breaking change. For backward-compatibility >>>> with existing data, timestamp semantics should be user-configurable on a >>>> per-table level. >>>> >>>> Before going into the specifics of a possible solution, do we all agree >>>> on these points? >>>> >>>> Thanks, >>>> >>>> Zoltan >>>> >>>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com> >>>> wrote: >>>> >>>>> I had asked zoltan to bring this discussion to the dev list because I >>>>> think it's a question that extends beyond a single jira (we can't figure >>>>> out the semantics of timestamp in parquet if we don't k ow the overall >>>>> goal >>>>> of the timestamp type) and since its a design question the entire >>>>> community >>>>> should be involved. >>>>> >>>>> I think that a lot of the confusion comes because we're talking about >>>>> different ways time zone affect behavior: (1) parsing and (2) behavior >>>>> when >>>>> changing time zones for processing data. >>>>> >>>>> It seems we agree that spark should eventually provide a timestamp >>>>> type which does conform to the standard. The question is, how do we get >>>>> there? Has spark already broken compliance so much that it's impossible >>>>> to >>>>> go back without breaking user behavior? Or perhaps spark already has >>>>> inconsistent behavior / broken compatibility within the 2.x line, so its >>>>> not unthinkable to have another breaking change? >>>>> >>>>> (Another part of the confusion is on me -- I believed the behavior >>>>> change was in 2.2, but actually it looks like its in 2.0.1. That changes >>>>> how we think about this in context of what goes into a 2.2 >>>>> release. SPARK-18350 isn't the origin of the difference in behavior.) >>>>> >>>>> First: consider processing data that is already stored in tables, and >>>>> then accessing it from machines in different time zones. The standard is >>>>> clear that "timestamp" should be just like "timestamp without time zone": >>>>> it does not represent one instant in time, rather it's always displayed >>>>> the >>>>> same, regardless of time zone. This was the behavior in spark 2.0.0 (and >>>>> 1.6), for hive tables stored as text files, and for spark's json formats. >>>>> >>>>> Spark 2.0.1 changed the behavior of the json format (I believe >>>>> with SPARK-16216), so that it behaves more like timestamp *with* time >>>>> zone. It also makes csv behave the same (timestamp in csv was basically >>>>> broken in 2.0.0). However it did *not* change the behavior of a hive >>>>> textfile; it still behaves like "timestamp with*out* time zone". Here's >>>>> some experiments I tried -- there are a bunch of files there for >>>>> completeness, but mostly focus on the difference between >>>>> query_output_2_0_0.txt vs. query_output_2_0_1.txt >>>>> >>>>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 >>>>> >>>>> Given that spark has changed this behavior post 2.0.0, is it still out >>>>> of the question to change this behavior to bring it back in line with the >>>>> sql standard for timestamp (without time zone) in the 2.x line? Or, as >>>>> reynold proposes, is the only option at this point to add an >>>>> off-by-default >>>>> feature flag to get "timestamp without time zone" semantics? >>>>> >>>>> >>>>> Second, there is the question of parsing strings into timestamp type. >>>>> I'm far less knowledgeable about this, so I mostly just have questions: >>>>> >>>>> * does the standard dictate what the parsing behavior should be for >>>>> timestamp (without time zone) when a time zone is present? >>>>> >>>>> * if it does and spark violates this standard is it worth trying to >>>>> retain the *other* semantics of timestamp without time zone, even if we >>>>> violate the parsing part? >>>>> >>>>> I did look at what postgres does for comparison: >>>>> >>>>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c >>>>> >>>>> spark's timestamp certainly does not match postgres's timestamp for >>>>> parsing, it seems closer to postgres's "timestamp with timezone" -- though >>>>> I dunno if that is standard behavior at all. >>>>> >>>>> thanks, >>>>> Imran >>>>> >>>>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> That's just my point 4, isn't it? >>>>>> >>>>>> >>>>>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io> >>>>>> wrote: >>>>>> >>>>>>> Reynold, >>>>>>> my point is that Spark should aim to follow the SQL standard instead >>>>>>> of rolling its own type system. >>>>>>> If I understand correctly, the existing implementation is similar to >>>>>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. >>>>>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH >>>>>>> TIMEZONE data types which are missing from Spark. >>>>>>> So, it is better (for me) if instead of extending the existing >>>>>>> types, Spark would just implement the additional well-defined types >>>>>>> properly. >>>>>>> Just trying to copy-paste CREATE TABLE between SQL engines should >>>>>>> not be an exercise of flags and incompatibilities. >>>>>>> >>>>>>> Regarding the current behaviour, if I remember correctly I had to >>>>>>> force our spark O/S user into UTC so Spark wont change my timestamps. >>>>>>> >>>>>>> Ofir Manor >>>>>>> >>>>>>> Co-Founder & CTO | Equalum >>>>>>> >>>>>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io >>>>>>> >>>>>>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Zoltan, >>>>>>>> >>>>>>>> Thanks for raising this again, although I'm a bit confused since >>>>>>>> I've communicated with you a few times on JIRA and on private emails to >>>>>>>> explain that you have some misunderstanding of the timestamp type in >>>>>>>> Spark >>>>>>>> and some of your statements are wrong (e.g. the except text file >>>>>>>> part). Not >>>>>>>> sure why you didn't get any of those. >>>>>>>> >>>>>>>> >>>>>>>> Here's another try: >>>>>>>> >>>>>>>> >>>>>>>> 1. I think you guys misunderstood the semantics of timestamp in >>>>>>>> Spark before session local timezone change. IIUC, Spark has always >>>>>>>> assumed >>>>>>>> timestamps to be with timezone, since it parses timestamps with >>>>>>>> timezone >>>>>>>> and does all the datetime conversions with timezone in mind (it doesn't >>>>>>>> ignore timezone if a timestamp string has timezone specified). The >>>>>>>> session >>>>>>>> local timezone change further pushes Spark to that direction, but the >>>>>>>> semantics has been with timezone before that change. Just run Spark on >>>>>>>> machines with different timezone and you will know what I'm talking >>>>>>>> about. >>>>>>>> >>>>>>>> 2. CSV/Text is not different. The data type has always been "with >>>>>>>> timezone". If you put a timezone in the timestamp string, it parses the >>>>>>>> timezone. >>>>>>>> >>>>>>>> 3. We can't change semantics now, because it'd break all existing >>>>>>>> Spark apps. >>>>>>>> >>>>>>>> 4. We can however introduce a new timestamp without timezone type, >>>>>>>> and have a config flag to specify which one (with tz or without tz) is >>>>>>>> the >>>>>>>> default behavior. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Sorry if you receive this mail twice, it seems that my first >>>>>>>>> attempt did not make it to the list for some reason. >>>>>>>>> >>>>>>>>> I would like to start a discussion about SPARK-18350 >>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it >>>>>>>>> gets released because it seems to be going in a different direction >>>>>>>>> than >>>>>>>>> what other SQL engines of the Hadoop stack do. >>>>>>>>> >>>>>>>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP >>>>>>>>> WITHOUT TIME ZONE) to have timezone-agnostic semantics - basically a >>>>>>>>> type >>>>>>>>> that expresses readings from calendars and clocks and is unaffected >>>>>>>>> by time >>>>>>>>> zone. In the Hadoop stack, Impala has always worked like this and >>>>>>>>> recently >>>>>>>>> Presto also took steps >>>>>>>>> <https://github.com/prestodb/presto/issues/7122> to become >>>>>>>>> standards compliant. (Presto's design doc >>>>>>>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit> >>>>>>>>> also contains a great summary of the different semantics.) Hive has a >>>>>>>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major >>>>>>>>> source of incompatibility that is already being addressed >>>>>>>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP >>>>>>>>> in SparkSQL, however, has UTC-normalized local time semantics (except >>>>>>>>> for >>>>>>>>> textfile), which is generally the semantics of the TIMESTAMP WITH >>>>>>>>> TIME ZONE >>>>>>>>> type. >>>>>>>>> >>>>>>>>> Given that timezone-agnostic TIMESTAMP semantics provide standards >>>>>>>>> compliance and consistency with most SQL engines, I was wondering >>>>>>>>> whether >>>>>>>>> SparkSQL should also consider it in order to become ANSI SQL >>>>>>>>> compliant and >>>>>>>>> interoperable with other SQL engines of the Hadoop stack. Should >>>>>>>>> SparkSQL >>>>>>>>> adapt this semantics in the future, SPARK-18350 >>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out >>>>>>>>> to be a source of problems. Please correct me if I'm wrong, but this >>>>>>>>> change >>>>>>>>> seems to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the >>>>>>>>> TIMESTAMP type. I think SPARK-18350 would be a great feature for a >>>>>>>>> separate >>>>>>>>> TIMESTAMP WITH TIME ZONE type, but the plain unqualified TIMESTAMP >>>>>>>>> type >>>>>>>>> would be better becoming timezone-agnostic instead of gaining further >>>>>>>>> timezone-aware capabilities. (Of course becoming timezone-agnostic >>>>>>>>> would be >>>>>>>>> a behavior change, so it must be optional and configurable by the >>>>>>>>> user, as >>>>>>>>> in Presto.) >>>>>>>>> >>>>>>>>> I would like to hear your opinions about this concern and about >>>>>>>>> TIMESTAMP semantics in general. Does the community agree that a >>>>>>>>> standards-compliant and interoperable TIMESTAMP type is desired? Do >>>>>>>>> you >>>>>>>>> perceive SPARK-18350 as a potential problem in achieving this or do I >>>>>>>>> misunderstand the effects of this change? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Zoltan >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> List of links in case in-line links do not work: >>>>>>>>> >>>>>>>>> - SPARK-18350: >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-18350 >>>>>>>>> - Presto's change: >>>>>>>>> https://github.com/prestodb/presto/issues/7122 >>>>>>>>> - Presto's design doc: >>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>