Re: SQL TIMESTAMP semantics vs. SPARK-18350

Michael Allman Fri, 02 Jun 2017 12:42:07 -0700

Hi Zoltan,

I don't fully understand your proposal for table-specific timestamp type 
semantics. I think it will be helpful to everyone in this conversation if you 
can identify the expected behavior for a few concrete scenarios.


Suppose we have a Hive metastore table hivelogs with a column named ts with the 
hive timestamp type as described here: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp
 
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp>.
 This table was created by Hive and is usually accessed through Hive or Presto.

Suppose again we have a Hive metastore table sparklogs with a column named ts 
with the Spark SQL timestamp type as described here: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$
 
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.TimestampType$>.
 This table was created by Spark SQL and is usually accessed through Spark SQL.

Let's say Spark SQL sets and reads a table property called timestamp_interp to 
determine timestamp type semantics for that table. Consider a dataframe df 
defined by sql("SELECT sts as ts FROM sparklogs UNION ALL SELECT hts as ts FROM 
hivelogs"). Suppose the timestamp_interp table property is absent from 
hivelogs. For each possible value of timestamp_interp set on the table 
sparklogs,

1. does df successfully pass analysis (i.e. is it a valid query)?
2. if it's a valid dataframe, what is the type of the ts column?
3. if it's a valid dataframe, what are the semantics of the type of the ts 
column?

Suppose further that Spark SQL sets the timestamp_interp on hivelogs. Can you 
answer the same three questions for each combination of timestamp_interp on 
hivelogs and sparklogs?

Thank you.

Michael


> On Jun 2, 2017, at 8:33 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
> 
> Hi,
> 
> We would like to solve the problem of interoperability of existing data, and 
> that is the main use case for having table-level control. Spark should be 
> able to read timestamps written by Impala or Hive and at the same time read 
> back its own data. These have different semantics, so having a single flag is 
> not enough.
> 
> Two separate types will solve this problem indeed, but only once every 
> component involved supports them. Unfortunately, adding these separate SQL 
> types is a larger effort that is only feasible in the long term and we would 
> like to provide a short-term solution for interoperability in the meantime.
> 
> Br,
> 
> Zoltan
> 
> On Fri, Jun 2, 2017 at 1:32 AM Reynold Xin <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> Yea I don't see why this needs to be per table config. If the user wants to 
> configure it per table, can't they just declare the data type on a per table 
> basis, once we have separate types for timestamp w/ tz and w/o tz? 
> 
> On Thu, Jun 1, 2017 at 4:14 PM, Michael Allman <mich...@videoamp.com 
> <mailto:mich...@videoamp.com>> wrote:
> I would suggest that making timestamp type behavior configurable and 
> persisted per-table could introduce some real confusion, e.g. in queries 
> involving tables with different timestamp type semantics.
> 
> I suggest starting with the assumption that timestamp type behavior is a 
> per-session flag that can be set in a global `spark-defaults.conf` and 
> consider more granular levels of configuration as people identify solid use 
> cases.
> 
> Cheers,
> 
> Michael
> 
> 
> 
>> On May 30, 2017, at 7:41 AM, Zoltan Ivanfi <z...@cloudera.com 
>> <mailto:z...@cloudera.com>> wrote:
>> 
>> Hi,
>> 
>> If I remember correctly, the TIMESTAMP type had UTC-normalized local time 
>> semantics even before Spark 2, so I can understand that Spark considers it 
>> to be the "established" behavior that must not be broken. Unfortunately, 
>> this behavior does not provide interoperability with other SQL engines of 
>> the Hadoop stack.
>> 
>> Let me summarize the findings of this e-mail thread so far:
>> Timezone-agnostic TIMESTAMP semantics would be beneficial for 
>> interoperability and SQL compliance.
>> Spark can not make a breaking change. For backward-compatibility with 
>> existing data, timestamp semantics should be user-configurable on a 
>> per-table level.
>> Before going into the specifics of a possible solution, do we all agree on 
>> these points?
>> 
>> Thanks,
>> 
>> Zoltan
>> 
>> On Sat, May 27, 2017 at 8:57 PM Imran Rashid <iras...@cloudera.com 
>> <mailto:iras...@cloudera.com>> wrote:
>> I had asked zoltan to bring this discussion to the dev list because I think 
>> it's a question that extends beyond a single jira (we can't figure out the 
>> semantics of timestamp in parquet if we don't k ow the overall goal of the 
>> timestamp type) and since its a design question the entire community should 
>> be involved.
>> 
>> I think that a lot of the confusion comes because we're talking about 
>> different ways time zone affect behavior: (1) parsing and (2) behavior when 
>> changing time zones for processing data.
>> 
>> It seems we agree that spark should eventually provide a timestamp type 
>> which does conform to the standard.   The question is, how do we get there?  
>> Has spark already broken compliance so much that it's impossible to go back 
>> without breaking user behavior?  Or perhaps spark already has inconsistent 
>> behavior / broken compatibility within the 2.x line, so its not unthinkable 
>> to have another breaking change?
>> 
>> (Another part of the confusion is on me -- I believed the behavior change 
>> was in 2.2, but actually it looks like its in 2.0.1.  That changes how we 
>> think about this in context of what goes into a 2.2 release.  SPARK-18350 
>> isn't the origin of the difference in behavior.)
>> 
>> First: consider processing data that is already stored in tables, and then 
>> accessing it from machines in different time zones.  The standard is clear 
>> that "timestamp" should be just like "timestamp without time zone": it does 
>> not represent one instant in time, rather it's always displayed the same, 
>> regardless of time zone.  This was the behavior in spark 2.0.0 (and 1.6),  
>> for hive tables stored as text files, and for spark's json formats.
>> 
>> Spark 2.0.1  changed the behavior of the json format (I believe with 
>> SPARK-16216), so that it behaves more like timestamp *with* time zone.  It 
>> also makes csv behave the same (timestamp in csv was basically broken in 
>> 2.0.0).  However it did *not* change the behavior of a hive textfile; it 
>> still behaves like "timestamp with*out* time zone".  Here's some experiments 
>> I tried -- there are a bunch of files there for completeness, but mostly 
>> focus on the difference between query_output_2_0_0.txt vs. 
>> query_output_2_0_1.txt
>> 
>> https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70 
>> <https://gist.github.com/squito/f348508ca7903ec2e1a64f4233e7aa70>
>> 
>> Given that spark has changed this behavior post 2.0.0, is it still out of 
>> the question to change this behavior to bring it back in line with the sql 
>> standard for timestamp (without time zone) in the 2.x line?  Or, as reynold 
>> proposes, is the only option at this point to add an off-by-default feature 
>> flag to get "timestamp without time zone" semantics?
>> 
>> 
>> Second, there is the question of parsing strings into timestamp type.   I'm 
>> far less knowledgeable about this, so I mostly just have questions:
>> 
>> * does the standard dictate what the parsing behavior should be for 
>> timestamp (without time zone) when a time zone is present?
>> 
>> * if it does and spark violates this standard is it worth trying to retain 
>> the *other* semantics of timestamp without time zone, even if we violate the 
>> parsing part?
>> 
>> I did look at what postgres does for comparison:
>> 
>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c 
>> <https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c>
>> 
>> spark's timestamp certainly does not match postgres's timestamp for parsing, 
>> it seems closer to postgres's "timestamp with timezone" -- though I dunno if 
>> that is standard behavior at all.
>> 
>> thanks,
>> Imran
>> 
>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com 
>> <mailto:r...@databricks.com>> wrote:
>> That's just my point 4, isn't it?
>> 
>> 
>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io 
>> <mailto:ofir.ma...@equalum.io>> wrote:
>> Reynold,
>> my point is that Spark should aim to follow the SQL standard instead of 
>> rolling its own type system.
>> If I understand correctly, the existing implementation is similar to 
>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle.. 
>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE 
>> data types which are missing from Spark.
>> So, it is better (for me) if instead of extending the existing types, Spark 
>> would just implement the additional well-defined types properly.
>> Just trying to copy-paste CREATE TABLE between SQL engines should not be an 
>> exercise of flags and incompatibilities. 
>> 
>> Regarding the current behaviour, if I remember correctly I had to force our 
>> spark O/S user into UTC so Spark wont change my timestamps.
>> 
>> Ofir Manor
>> 
>> Co-Founder & CTO | Equalum
>> 
>> 
>> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: 
>> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io>
>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com 
>> <mailto:r...@databricks.com>> wrote:
>> Zoltan,
>> 
>> Thanks for raising this again, although I'm a bit confused since I've 
>> communicated with you a few times on JIRA and on private emails to explain 
>> that you have some misunderstanding of the timestamp type in Spark and some 
>> of your statements are wrong (e.g. the except text file part). Not sure why 
>> you didn't get any of those.
>> 
>> 
>> Here's another try:
>> 
>> 
>> 1. I think you guys misunderstood the semantics of timestamp in Spark before 
>> session local timezone change. IIUC, Spark has always assumed timestamps to 
>> be with timezone, since it parses timestamps with timezone and does all the 
>> datetime conversions with timezone in mind (it doesn't ignore timezone if a 
>> timestamp string has timezone specified). The session local timezone change 
>> further pushes Spark to that direction, but the semantics has been with 
>> timezone before that change. Just run Spark on machines with different 
>> timezone and you will know what I'm talking about.
>> 
>> 2. CSV/Text is not different. The data type has always been "with timezone". 
>> If you put a timezone in the timestamp string, it parses the timezone.
>> 
>> 3. We can't change semantics now, because it'd break all existing Spark apps.
>> 
>> 4. We can however introduce a new timestamp without timezone type, and have 
>> a config flag to specify which one (with tz or without tz) is the default 
>> behavior.
>> 
>> 
>> 
>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com 
>> <mailto:z...@cloudera.com>> wrote:
>> Hi,
>> 
>> Sorry if you receive this mail twice, it seems that my first attempt did not 
>> make it to the list for some reason.
>> 
>> I would like to start a discussion about SPARK-18350 
>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets released 
>> because it seems to be going in a different direction than what other SQL 
>> engines of the Hadoop stack do.
>> 
>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME 
>> ZONE) to have timezone-agnostic semantics - basically a type that expresses 
>> readings from calendars and clocks and is unaffected by time zone. In the 
>> Hadoop stack, Impala has always worked like this and recently Presto also 
>> took steps <https://github.com/prestodb/presto/issues/7122> to become 
>> standards compliant. (Presto's design doc 
>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>  also contains a great summary of the different semantics.) Hive has a 
>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major source 
>> of incompatibility that is already being addressed 
>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in 
>> SparkSQL, however, has UTC-normalized local time semantics (except for 
>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE 
>> type.
>> 
>> Given that timezone-agnostic TIMESTAMP semantics provide standards 
>> compliance and consistency with most SQL engines, I was wondering whether 
>> SparkSQL should also consider it in order to become ANSI SQL compliant and 
>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL 
>> adapt this semantics in the future, SPARK-18350 
>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be a 
>> source of problems. Please correct me if I'm wrong, but this change seems to 
>> explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP type. 
>> I think SPARK-18350 would be a great feature for a separate TIMESTAMP WITH 
>> TIME ZONE type, but the plain unqualified TIMESTAMP type would be better 
>> becoming timezone-agnostic instead of gaining further timezone-aware 
>> capabilities. (Of course becoming timezone-agnostic would be a behavior 
>> change, so it must be optional and configurable by the user, as in Presto.)
>> 
>> I would like to hear your opinions about this concern and about TIMESTAMP 
>> semantics in general. Does the community agree that a standards-compliant 
>> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as 
>> a potential problem in achieving this or do I misunderstand the effects of 
>> this change?
>> 
>> Thanks,
>> 
>> Zoltan
>> 
>> ---
>> 
>> List of links in case in-line links do not work:
>> SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350 
>> <https://issues.apache.org/jira/browse/SPARK-18350>
>> Presto's change: https://github.com/prestodb/presto/issues/7122 
>> <https://github.com/prestodb/presto/issues/7122>
>> Presto's design doc: 
>> https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>>  
>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>> 
>> 
>> 
>> 
> 
>

Re: SQL TIMESTAMP semantics vs. SPARK-18350

Reply via email to