Hi everyone,

I'm not sure if we should introduce the `auto` mode. Taking all the previous discussions around batch-stream unification into account, batch mode and streaming mode should only influence the runtime efficiency and incremental computation. The final query result should be the same in both modes. Also looking into the long-term future, we might drop the mode property and either derive the mode or use different modes for parts of the pipeline.

"I think we may need to think more from the users' perspective."

I agree here and that's why I actually would like to let the user decide which semantics are needed. The config option proposal was my least favored alternative. We should stick to the standard and bahavior of other systems. For both batch and streaming. And use a simple prefix to let users decide whether the semantics are per-record or per-query:

CURRENT_TIMESTAMP       -- semantics as all other vendors


_CURRENT_TIMESTAMP      -- semantics per record

OR

SYS_CURRENT_TIMESTAMP      -- semantics per record


Please check how other vendors are handling this:

SYSDATE          MySql, Oracle
SYSDATETIME      SQL Server


Regards,
Timo


On 02.02.21 07:02, Jingsong Li wrote:
+1 for the default "auto" to the "table.exec.time-function-evaluation".

From the definition of these functions, in my opinion:
- Batch is the instant execution of all records, which is the meaning of
the word "BATCH", so there is only one time at query-start.
- Stream only executes a single record in a moment, so time is generated by
each record.

On the other hand, we should be more careful about consistency with other
systems.

Best,
Jingsong

On Tue, Feb 2, 2021 at 11:24 AM Jark Wu <imj...@gmail.com> wrote:

Hi Leonard, Timo,

I just did some investigation and found all the other batch processing
systems
  evaluate the time functions at query-start, including Snowflake, Hive,
Spark, Trino.
I'm wondering whether the default 'per-record' mode will still be weird for
batch users.
I know we proposed the option for batch users to change the behavior.
However if 90% users need to set this config before submitting batch jobs,
why not
use this mode for batch by default? For the other 10% special users, they
can still
set the config to per-record before submitting batch jobs. I believe this
can greatly
improve the usability for batch cases.

Therefore, what do you think about using "auto" as the default option
value?

It evaluates time functions per-record in streaming mode and evaluates at
query start in batch mode.
I think this can make both streaming users and batch users happy. IIUC, the
reason why we
proposing the default "per-record" mode is for the batch streaming
consistent.
However, I think time functions are special cases because they are
naturally non-deterministic.
Even if streaming jobs and batch jobs all use "per-record" mode, they still
can't provide consistent
results. Thus, I think we may need to think more from the users'
perspective.

Best,
Jark


On Mon, 1 Feb 2021 at 23:06, Timo Walther <twal...@apache.org> wrote:

Hi Leonard,

thanks for considering this issue as well. +1 for the proposed config
option. Let's start a voting thread once the FLIP document has been
updated if there are no other concerns?

Thanks,
Timo


On 01.02.21 15:07, Leonard Xu wrote:
Hi, all

I’ve discussed with @Timo @Jark about the time function evaluation
further. We reach a consensus that we’d better address the time function
evaluation(function value materialization) in this FLIP as well.

We’re fine with introducing an option
table.exec.time-function-evaluation to control the materialize time point
of time function value. The time function includes
LOCALTIME
LOCALTIMESTAMP
CURRENT_DATE
CURRENT_TIME
CURRENT_TIMESTAMP
NOW()
The default value of table.exec.time-function-evaluation is
'per-record', which means Flink evaluates the function value per record,
we
recommend users config this option value for their streaming pipe lines.
Another valid option value is ’query-start’, which means Flink
evaluates
the function value at the query start, we recommend users config this
option value for their batch pipelines.
In the future, more valid evaluation option value like ‘auto' may be
supported if there’re new requirements, e.g: support ‘auto’ option which
evaluates time function value per-record in streaming mode and evaluates
time function value at query start in batch mode.

Alternative1:
       Introduce function like CURRENT_TIMESTAMP2/CURRENT_TIMESTAMP_NOW
which evaluates function value at query start. This may confuse users a
bit
that we provide two similar functions but with different return value.


Alternative2:
         Do not introduce any configuration/function, control the
function evaluation by pipeline execution mode. This may produce
different
result when user use their  streaming pipeline sql to run a batch
pipeline(e.g backfilling), and user also
can not control these function behavior.


How do you think ?

Thanks,
Leonard


在 2021年2月1日,18:23,Timo Walther <twal...@apache.org> 写道:

Parts of the FLIP can already be implemented without a completed
voting, e.g. there is no doubt that we should support TIME(9).

However, I don't see a benefit of reworking the time functions to
rework them again later. If we lock the time on query-start the
implementation of the previsouly mentioned functions will be completely
different.

Regards,
Timo


On 01.02.21 02:37, Kurt Young wrote:
I also prefer to not expand this FLIP further, but we could open a
discussion thread
right after this FLIP being accepted and start coding & reviewing.
Make
technique
discussion and coding more pipelined will improve efficiency.
Best,
Kurt
On Sat, Jan 30, 2021 at 3:47 PM Leonard Xu <xbjt...@gmail.com>
wrote:
Hi, Timo

I do think that this topic must be part of the FLIP as well. Esp.
if
the
FLIP has the title "time function behavior" and this is clearly a
behavioral aspect. We are performing a heavy refactoring of the SQL
query
semantics in Flink here which will affect a lot of users. We cannot
rework
the time functions a third time after this.
I checked a couple of other vendors. It seems that they all lock
the
timestamp when the query is started. And as you said, in this case
both
mature (Oracle) and less mature systems (Hive, MySQL) have the same
behavior.

FLIP-162> “These problems come from the fact that lots of
time-related
functions like PROCTIME(), NOW(), CURRENT_DATE, CURRENT_TIME and
CURRENT_TIMESTAMP are returning time values based on UTC+0 time
zone."
The motivation of  FLIP-162 is to correct the wrong time-related
function
value which caused by timezone. And after our discussed before, we
found
it's related to the function return type compared to SQL standard
and
other
vendors and thus we proposed make the function return type also
consistent.
This is the exact meaning of the FLIP  title and that the FLIP plans
to do.

But for the function materialization mechanism, we didn't consider
yet as
a part of our plan because we need to fix the timezone and function
type
issues no matter we modify the function materialization mechanism in
the
future or not.
So I think it's not belong to this FLIP scope.

It will have been a great work if we can fix current FLIP's 7
proposals
well, we don't want to expand the scope again Eps it's not part of
our
plan.

What do you think? @Timo

And what’s others' thoughts?  @Jark @Kurt

Best,
Leonard




Flink should not differ. I fear that we have to adopt this behavior
as
well to call us standard compliant. Otherwise it will also not be
possible
to have Hive compatibility with proper semantics. It could lead to
unintended behavior.

I see two options for this topic:

1) Clearly distinguish between query-start and processing time

MySQL offers NOW() and SYSDATE() to distinguish the two semantics.
We
could run all the previously discussed functions that have a meaning
in
other systems in query-start time and use a different name for
processing
time. `SYS_TIMESTAMP`, `SYS_DATE`, `SYS_TIME`, `SYS_LOCALTIMESTAMP`,
`SYS_LOCALDATE`, `SYS_LOCALTIME`?

2) Introduce a config option

We are non-compliant by default and allow typical batch behavior if
needed via a config option. But batch/stream unification should not
mean
that we disable certain unification aspects by default.

What do you think?

Regards,
Timo

On 28.01.21 16:51, Leonard Xu wrote:
Hi, Timo
I'm sorry that I need to open another discussion thread befoe
voting
but I think we should also discuss this in this FLIP before it pops
up at a
later stage.

How do we want our time functions to behave in long running
queries?
It’s okay to open this thread. Although I don’t want to consider
the
function value materialization in this FLIP scope,  I could try
explain
something.
See also:



https://stackoverflow.com/questions/5522656/sql-now-in-long-running-query

I think this was never discussed thoroughly. Actually
CURRENT_TIMESTAMP/NOW/LOCALTIMESTAMP should have slightly different
semantics than PROCTIME(). What it is our current behavior? Are we
materializing those time values during planning?
Currently CURRENT_TIMESTAMP/NOW/LOCALTIMESTAMP  keeps same
behavior
in
both Batch and Stream world,  the function value is materialized for
per
record not the query start(plan phase).
For  PROCTIME(), it also keeps same behavior  in both Batch and
Stream
world, in fact we just supported PROCTIME() in Batch last week[1].
In one word, we keep same semantics/behavior for Batch and Stream.
Esp. long running batch queries might suffer from inconsistencies
here. When a timestamp is produced by one operator using
CURRENT_TIMESTAMP
and a different one might filter relating to CURRENT_TIMESTAMP.
It’s a good question, and I've found some users have asked
simillar
questions in user/user-zh mail-list,  given a fact that many Batch
systems
like Hive/Presto using the value of query start, but it’s not
suitable for
Stream engine, for example user will use CURRENT_TIMESTAMP to define
event
time.
As a unified Batch/Stream SQL engine, keep same semantics/behavior
is
important, and I agree the Batch user case should also be
considered.
But I think this should be discussed in another topic like 'the
unification of Batch/Stream' which is beyond the scope of this FLIP.
This FLIP aims to correct the wrong return type/return value of
current
time functions.
Best,
Leonard
[1] https://issues.apache.org/jira/browse/FLINK-17868 <
https://issues.apache.org/jira/browse/FLINK-17868> <
https://issues.apache.org/jira/browse/FLINK-17868 <
https://issues.apache.org/jira/browse/FLINK-17868>>
Regards,
Timo


On 28.01.21 13:46, Leonard Xu wrote:
Hi, Jark
I have a minor suggestion:
I think we will still suggest users use TIMESTAMP even if we
have
TIMESTAMP_NTZ. Then it seems
introducing TIMESTAMP_NTZ doesn't help much for users, but
introduces more learning costs.
I think your suggestion makes sense, we should suggest users use
TIMESTAMP for TIMESTAMP WITHOUT TIME ZONE as we did now, updated as
following:
     original type name :
                        shortcut type name :
TIMESTAMP / TIMESTAMP WITHOUT TIME ZONE         <=> TIMESTAMP
TIMESTAMP WITH LOCAL TIME ZONE                            <=>
TIMESTAMP_LTZ
TIMESTAMP WITH TIME ZONE
  <=>
TIMESTAMP_TZ     (supports them in the future)
Best,
Leonard





On Thu, 28 Jan 2021 at 18:52, Leonard Xu <xbjt...@gmail.com
<mailto:
xbjt...@gmail.com> <mailto:xbjt...@gmail.com <mailto:
xbjt...@gmail.com>>>
wrote:

Thanks all for sharing your opinions.

Looks like  we’ve reached a consensus about the topic.

@Timo:
1) Are we on the same page that LOCALTIMESTAMP returns
TIMESTAMP
and not
TIMESTAMP_LTZ? Maybe we should quickly list also
LOCALTIME/LOCALDATE and
LOCALTIMESTAMP for completeness.
Yes, LOCALTIMESTAMP returns TIMESTAMP, LOCALTIME returns TIME,
the
behavior of them is clear so I just listed them in the
excel[1]
of
this
FLIP references.

2) Shall we add aliases for the timestamp types as part of
this
FLIP? I
see Snowflake supports TIMESTAMP_LTZ , TIMESTAMP_NTZ ,
TIMESTAMP_TZ
[1]. I
think the discussion was quite cumbersome with the full string
of
`TIMESTAMP WITH LOCAL TIME ZONE`. With this FLIP we are making
this
type
even more prominent. And important concepts should have a
short
name
because they are used frequently. According to the FLIP, we
are
introducing
the abbriviation already in function names like
`TO_TIMESTAMP_LTZ`.
`TIMESTAMP_LTZ` could be treated similar to `STRING` for
`VARCHAR(MAX_INT)`, the serializable string representation
would
not change.

@Timo @Jark
Nice idea, I also suffered from the long name during the
discussions, the
abbreviation will not only help us, but also makes it more
convenient for
users. I list the abbreviation name mapping to support:
TIMESTAMP WITHOUT TIME ZONE         <=> TIMESTAMP_NTZ   (which
synonyms
TIMESTAMP)
TIMESTAMP WITH LOCAL TIME ZONE    <=> TIMESTAMP_LTZ
TIMESTAMP WITH TIME ZONE                 <=> TIMESTAMP_TZ
   (supports
them in the future)
3) I'm fine with supporting all conversion classes like
java.time.LocalDateTime, java.sql.Timestamp that TimestampType
supported
for LocalZonedTimestampType. But we agree that Instant stays
the
default
conversion class right? The default extraction defined in [2]
will
not
change, correct?
Yes, Instant stays the default conversion class. The default

4) I would remove the comment "Flink supports TIME-related
types
with
precision well", because unfortunately this is still not
correct.
We still
have issues with TIME(9), it would be great if someone can
finally
fix that
though. Maybe the implementation of this FLIP would be a good
time
to fix
this issue.
You’re right, TIME(9) is not supported yet, I'll take account
of
TIME(9)
to the scope of this FLIP.


I’ve updated this FLIP[2] according your suggestions @Jark
@Timo
I’ll start the vote soon if there’re no objections.

Best,
Leonard

[1]



https://docs.google.com/spreadsheets/d/1T178krh9xG-WbVpN7mRVJ8bzFnaSJx3l-eg1EWZe_X4/edit?usp=sharing
<



https://docs.google.com/spreadsheets/d/1T178krh9xG-WbVpN7mRVJ8bzFnaSJx3l-eg1EWZe_X4/edit?usp=sharing
<


https://docs.google.com/spreadsheets/d/1T178krh9xG-WbVpN7mRVJ8bzFnaSJx3l-eg1EWZe_X4/edit?usp=sharing


[2]



https://cwiki.apache.org/confluence/display/FLINK/FLIP-162%3A+Consistent+Flink+SQL+time+function+behavior
<


https://cwiki.apache.org/confluence/display/FLINK/FLIP-162%3A+Consistent+Flink+SQL+time+function+behavior

<



https://cwiki.apache.org/confluence/display/FLINK/FLIP-162:+Consistent+Flink+SQL+time+function+behavior
<


https://cwiki.apache.org/confluence/display/FLINK/FLIP-162:+Consistent+Flink+SQL+time+function+behavior




On 28.01.21 03:18, Jark Wu wrote:
Thanks Leonard for the further investigation.
I think we all agree we should correct the return value of
CURRENT_TIMESTAMP.
Regarding the return type of CURRENT_TIMESTAMP, I also agree
TIMESTAMP_LTZ
would be more worldwide useful. This may need more effort,
but if
this
is
the right direction, we should do it.
Regarding the CURRENT_TIME, if CURRENT_TIMESTAMP returns
TIMESTAMP_LTZ, then I think CURRENT_TIME shouldn't return
TIME_TZ.
Otherwise, CURRENT_TIME will be quite special and strange.
Thus I think it has to return TIME type. Given that we
already
have
CURRENT_DATE which returns
DATE WITHOUT TIME ZONE, I think it's fine to return TIME
WITHOUT
TIME
ZONE
for CURRENT_TIME.
In a word, the updated FLIP looks good to me. I especially
like
the
proposed new function TO_TIMESTAMP_LTZ(numeric, [,scale]).
This will be very convenient to define rowtime on a long
value
which is
a
very common case and has been complained a lot in mailing
list.
Best,
Jark
On Mon, 25 Jan 2021 at 21:12, Kurt Young <ykt...@gmail.com>
wrote:
Thanks Leonard for the detailed response and also the bad
case
about
option
1, these all
make sense to me.

Also nice catch about conversion support of
LocalZonedTimestampType, I
think it actually
makes sense to support java.sql.Timestamp as well as
java.time.LocalDateTime. It also has
a slight benefit that we might have a chance to run the udf
which took
them
as input parameter
after we change the return type.

Regarding to the return type of CURRENT_TIME, I also think
timezone
information is not useful.
To not expand this FLIP further, I'm lean to keep it as it
is.

Best,
Kurt


On Mon, Jan 25, 2021 at 8:50 PM Leonard Xu <
xbjt...@gmail.com>
wrote:

Hi, All

Thanks for your comments. I think all of the thread have
agreed
that:
(1) The return values of
CURRENT_TIME/CURRENT_TIMESTAMP/NOW()/PROCTIME()
are wrong.
(2) The LOCALTIME/LOCALTIMESTAMP and
CURRENT_TIME/CURRENT_TIMESTAMP
should
be different whether from SQL standard’s perspective or
mature
systems.
(3) The semantics of three TIMESTAMP types in Flink SQL
follows
the
SQL
standard and also keeps the same with other 'good'
vendors.
     TIMESTAMP                                   =>  A
literal in
‘yyyy-MM-dd HH:mm:ss’ format to describe a time, does not
contain
timezone
info, can not represent an absolute time point.
     TIMESTAMP WITH LOCAL ZONE =>  Records the elapsed time
from
absolute
time point origin, can represent an absolute time point,
requires
local
time zone when expressed with ‘yyyy-MM-dd HH:mm:ss’
format.
     TIMESTAMP WITH TIME ZONE    =>  Consists of time zone
info
and a
literal in ‘yyyy-MM-dd HH:mm:ss’ format to describe time,
can
represent
an
absolute time point.


Currently we've two ways to correct
CURRENT_TIME/CURRENT_TIMESTAMP/NOW()/PROCTIME().

option (1): As the FLIP proposed, change the return value
from
UTC
timezone to local timezone.
         Pros:   (1) The change looks smaller to users and
developers
(2)
There're many SQL engines adopted this way
         Cons:  (1) connector devs may confuse the
underlying
value of
TimestampData which needs to change according to data type
(2)
I
thought
about this weekend. Unfortunately I found a bad case:

The proposal is fine if we only use it in FLINK SQL world,
but
we
need to
consider the conversion between Table/DataStream, assume a
record
produced
in UTC+0 timezone with TIMESTAMP '1970-01-01 08:00:44'
and
the
Flink
SQL
processes the data with session time zone 'UTC+8', if the
sql
program
need
to convert the Table to DataStream, then we need to
calculate
the
timestamp
in StreamRecord with session time zone (UTC+8), then we
will
get 44 in
DataStream program, but it is wrong because the expected
value
should
be
(8
* 60 * 60 + 44). The corner case tell us that the
ROWTIME/PROCTIME in
Flink
are based on UTC+0, when correct the PROCTIME() function,
the
better
way
is
to use TIMESTAMP WITH LOCAL TIME ZONE which keeps same
long
value with
time
based on UTC+0 and can be expressed with  local timezone.

option (2) : As we considered in the FLIP as well as @Timo
suggested,
change the return type to TIMESTAMP WITH LOCAL TIME ZONE,
the
expressed
value depends on the local time zone.
         Pros: (1) Make Flink SQL more close to SQL
standard  (2)
Can
deal
the conversion between Table/DataStream well
         Cons: (1) We need to discuss the return value/type
of
CURRENT_TIME
function (2) The change is bigger to users, we need to
support
TIMESTAMP
WITH LOCAL TIME ZONE in connectors/formats as well as
custom
connectors.
                    (3)The TIMESTAMP WITH LOCAL TIME ZONE
support
is
weak
in Flink, thus we need some improvement,but the workload
does
not
matter
as long as we are doing the right thing ^_^

Due to the above bad case for option (1). I think option 2
should be
adopted,
But we also need to consider some problems:
(1) More conversion classes like LocalDateTime,
sql.Timestamp
should
be
supported for LocalZonedTimestampType to resolve the UDF
compatibility
issue
(2) The timezone offset for window size of one day should
still
be
considered
(3) All connectors/formats should supports TIMESTAMP WITH
LOCAL
TIME
ZONE
well and we also should record in document
I’ll update these sections of FLIP-162.



We also need to discuss the CURRENT_TIME function. I know
the
standard
way
is using TIME WITH TIME ZONE(there's no TIME WITH LOCAL
TIME
ZONE),
but
we
don't support this type yet and I don't see strong
motivation to
support
it
so far.
Compared to CURRENT_TIMESTAMP, the CURRENT_TIME can not
represent an
absolute time point which should be considered as a string
consisting
of
a
time with 'HH:mm:ss' format and time zone info.  We have
several
options
for this:
(1) We can forbid CURRENT_TIME as @Timo proposed to make
all
Flink SQL
functions follow the standard well,  in this way, we need
to
offer
some
guidance for user upgrading Flink versions.
(2) We can also support it from a user's perspective who
has
used
CURRENT_DATE/CURRENT_TIME/CURRENT_TIMESTAMP, btw,Snowflake
also
returns
TIME type.
(3) Returns TIMESTAMP WITH LOCAL TIME ZONE to make it
equal
to
CURRENT_TIMESTAMP as Calcite did.

I can image (1) which we don't want to left a bad smell in
Flink SQL,
and
I also accept (2) because I think users do not consider
time
zone
issues
when they use CURRENT_DATE/CURRENT_TIME, and the timezone
info
in
time is
not very useful.

I don’t have a strong opinion  for them.  What do others
think?


I hope I've addressed your concerns. @Timo @Kurt

Best,
Leonard



Most of the mature systems have a clear difference
between
CURRENT_TIMESTAMP and LOCALTIMESTAMP. I wouldn't take
Spark
or
Hive
as a
good example. Snowflake decided for TIMESTAMP WITH LOCAL
TIME
ZONE.
As I
mentioned in the last comment, I could also imagine this
behavior for
Flink. But in any case, there should be some time zone
information
considered in order to cast to all other types.

The function CURRENT_DATE/CURRENT_TIME is supporting
in
SQL
standard, but
LOCALDATE not, I don’t think it’s a good idea that
dropping
functions which
SQL standard supported and introducing a replacement
which
SQL
standard not
reminded.

We can still add those functions in the future. But since
we
don't
offer
a TIME WITH TIME ZONE, it is better to not support this
function at
all
for
now. And by the way, this is exactly the behavior that
also
Microsoft
SQL
Server does: it also just supports CURRENT_TIMESTAMP (but
it
returns
TIMESTAMP without a zone which completes the confusion).

I also agree returning  TIMESTAMP WITH LOCAL TIME ZONE
for
PROCTIME
has
more clear semantics, but I realized that user didn’t
care
the
type
but
more about the expressed value they saw, and change
the
type from
TIMESTAMP
to TIMESTAMP WITH LOCAL TIME ZONE brings huge refactor
that
we
need
consider all places where the TIMESTAMP type used

  From a UDF perspective, I think nothing will change. The
new
type
system
and type inference were designed to support all these
cases.
There is
a
reason why Java has adopted Joda time, because it is hard
to
come up
with a
good time library. That's why also we and the other Hadoop
ecosystem
folks
have decided for 3 different kinds of LocalDateTime,
ZonedDateTime,
and
Instance. It makes the library more complex, but time is a
complex
topic.

I also doubt that many users work with only one time
zone.
Take the
US
as an example, a country with 3 different timezones.
Somebody
working
with
US data cannot properly see the data points with just
LOCAL
TIME ZONE.
But
on the other hand, a lot of event data is stored using a
UTC
timestamp.


Before jumping into technique details, let's take a
step
back to
discuss
user experience.

The first important question is what kind of date and
time
will
Flink
display when users call
   CURRENT_TIMESTAMP and maybe also PROCTIME (if we
think
they
are
similar).

Should it always display the date and time in UTC or in
the
user's
time
zone?

@Kurt: I think we all agree that the current behavior
with
just
showing
UTC is wrong. Also, we all agree that when calling
CURRENT_TIMESTAMP
or
PROCTIME a user would like to see the time in it's current
time
zone.

As you said, "my wall clock time".

However, the question is what is the data type of what
you
"see". If
you
pass this record on to a different system, operator, or
different
cluster,
should the "my" get lost or materialized into the record?

TIMESTAMP -> completely lost and could cause confusion
in a
different
system

TIMESTAMP WITH LOCAL TIME ZONE -> at least the UTC is
correct,
so you
can provide a new local time zone

TIMESTAMP WITH TIME ZONE -> also "your" location is
persisted

Regards,
Timo




On 22.01.21 09:38, Kurt Young wrote:
Forgot one more thing. Continue with displaying in UTC.
As a
user,
if
Flink
want to display the timestamp
in UTC, why don't we offer something like UTC_TIMESTAMP?
Best,
Kurt
On Fri, Jan 22, 2021 at 4:33 PM Kurt Young <
ykt...@gmail.com>
wrote:
Before jumping into technique details, let's take a
step
back to
discuss
user experience.

The first important question is what kind of date and
time
will
Flink
display when users call
CURRENT_TIMESTAMP and maybe also PROCTIME (if we think
they
are
similar).

Should it always display the date and time in UTC or in
the
user's
time
zone? I think this part is the
reason that surprised lots of users. If we forget about
the
type
and
internal representation of these
two methods, as a user, my instinct tells me that these
two
methods
should
display my wall clock time.

Display time in UTC? I'm not sure, why I should care
about
UTC
time?
I
want to get my current timestamp.
For those users who have never gone abroad, they might
not
even be
able to
realize that this is affected
by the time zone.

Best,
Kurt


On Fri, Jan 22, 2021 at 12:25 PM Leonard Xu <
xbjt...@gmail.com>
wrote:

Thanks @Timo for the detailed reply, let's go on this
topic
on
this
discussion,  I've merged all mails to this discussion.

LOCALDATE / LOCALTIME / LOCALTIMESTAMP

--> uses session time zone, returns
DATE/TIME/TIMESTAMP


CURRENT_DATE/CURRENT_TIME/CURRENT_TIMESTAMP

--> uses session time zone, returns
DATE/TIME/TIMESTAMP

I'm very sceptical about this behavior. Almost all
mature
systems
(Oracle, Postgres) and new high quality systems
(Presto,
Snowflake)
use a
data type with some degree of time zone information
encoded. In a
globalized world with businesses spanning different
regions, I
think
we
should do this as well. There should be a difference
between
CURRENT_TIMESTAMP and LOCALTIMESTAMP. And users should
be
able to
choose
which behavior they prefer for their pipeline.


I know that the two series should be different at
first
glance,
but
different SQL engines can have their own
explanations,for
example,
CURRENT_TIMESTAMP and LOCALTIMESTAMP are synonyms in
Snowflake[1]
and
has
no difference, and Spark only supports the later one
and
doesn’t
support
LOCALTIME/LOCALTIMESTAMP[2].


If we would design this from scatch, I would suggest
the
following:

- drop CURRENT_DATE / CURRENT_TIME and let users pick
LOCALDATE /
LOCALTIME for materialized timestamp parts

The function CURRENT_DATE/CURRENT_TIME is supporting
in
SQL
standard,
but
LOCALDATE not, I don’t think it’s a good idea that
dropping
functions
which
SQL standard supported and introducing a replacement
which
SQL
standard not
reminded.


- CURRENT_TIMESTAMP should return a TIMESTAMP WITH
TIME
ZONE to
materialize all session time information into every
record.
It it
the
most
generic data type and allows to cast to all other
timestamp
data
types.
This generic ability can be used for filter predicates
as
well
either
through implicit or explicit casting.

TIMESTAMP WITH TIME ZONE indeed contains more
information to
describe
a
time point, but the type TIMESTAMP  can cast to all
other
timestamp
data
types combining with session time zone as well, and it
also
can be
used for
filter predicates. For type casting between BIGINT and
TIMESTAMP,
I
think
the function way using
TO_TIMEMTAMP()/FROM_UNIXTIMESTAMP()
is more
clear.

PROCTIME/ROWTIME should be time functions based on a
long
value.
Both
System.currentMillis() and our watermark system work
on
long
values.
Those
should return TIMESTAMP WITH LOCAL TIME ZONE because
the
main
calculation
should always happen based on UTC.
We discussed it in a different thread, but we should
allow
PROCTIME
globally. People need a way to create instances of
TIMESTAMP WITH
LOCAL
TIME ZONE. This is not considered in the current
design
doc.
Many pipelines contain UTC timestamps and thus it
should
be easy
to
create one.
Also, both CURRENT_TIMESTAMP and LOCALTIMESTAMP can
work
with
this
type
because we should remember that TIMESTAMP WITH LOCAL
TIME
ZONE
accepts all
timestamp data types as casting target [1]. We could
allow
TIMESTAMP
WITH
TIME ZONE in the future for ROWTIME.
In any case, windows should simply adapt their
behavior to
the
passed
timestamp type. And with TIMESTAMP WITH LOCAL TIME
ZONE
a
day is
defined by
considering the current session time zone.

I also agree returning  TIMESTAMP WITH LOCAL TIME ZONE
for
PROCTIME
has
more clear semantics, but I realized that user didn’t
care
the
type
but
more about the expressed value they saw, and change
the
type from
TIMESTAMP
to TIMESTAMP WITH LOCAL TIME ZONE brings huge refactor
that
we
need
consider all places where the TIMESTAMP type used, and
many
builtin
functions and UDFs doest not support  TIMESTAMP WITH
LOCAL
TIME
ZONE
type.
That means both user and Flink devs need to refactor
the
code(UDF,
builtin
functions, sql pipeline), to be honest, I didn’t see
strong
motivation that
we have to do the pretty big refactor from user’s
perspective and
developer’s perspective.

In one word, both your suggestion and my proposal can
resolve
almost
all
user problems,the divergence is whether we need to
spend
pretty
energy just
to get a bit more accurate semantics?   I think we
need
a
tradeoff.


Best,
Leonard
[1]




https://trino.io/docs/current/functions/datetime.html#current_timestamp
<




https://trino.io/docs/current/functions/datetime.html#current_timestamp>
[2] https://issues.apache.org/jira/browse/SPARK-30374
<
https://issues.apache.org/jira/browse/SPARK-30374>






2021-01-22,00:53,Timo Walther <twal...@apache.org> :

Hi Leonard,

thanks for working on this topic. I agree that time
handling is
not
easy in Flink at the moment. We added new time data
types
(and
some
are
still not supported which even further complicates
things
like
TIME(9)). We
should definitely improve this situation for users.

This is a pretty opinionated topic and it seems that
the
SQL
standard
is not really deciding this but is at least
supporting.
So
let me
express
my opinion for the most important functions:

LOCALDATE / LOCALTIME / LOCALTIMESTAMP

--> uses session time zone, returns
DATE/TIME/TIMESTAMP

I think those are the most obvious ones because the
LOCAL
indicates
that the locality should be materialized into the
result
and any
time
zone
information (coming from session config or data) is
not
important
afterwards.

CURRENT_DATE/CURRENT_TIME/CURRENT_TIMESTAMP

--> uses session time zone, returns
DATE/TIME/TIMESTAMP

I'm very sceptical about this behavior. Almost all
mature
systems
(Oracle, Postgres) and new high quality systems
(Presto,
Snowflake)
use a
data type with some degree of time zone information
encoded. In a
globalized world with businesses spanning different
regions, I
think
we
should do this as well. There should be a difference
between
CURRENT_TIMESTAMP and LOCALTIMESTAMP. And users should
be
able to
choose
which behavior they prefer for their pipeline.

If we would design this from scatch, I would suggest
the
following:

- drop CURRENT_DATE / CURRENT_TIME and let users pick
LOCALDATE /
LOCALTIME for materialized timestamp parts

- CURRENT_TIMESTAMP should return a TIMESTAMP WITH
TIME
ZONE to
materialize all session time information into every
record.
It it
the
most
generic data type and allows to cast to all other
timestamp
data
types.
This generic ability can be used for filter predicates
as
well
either
through implicit or explicit casting.

PROCTIME/ROWTIME should be time functions based on a
long
value.
Both
System.currentMillis() and our watermark system work
on
long
values.
Those
should return TIMESTAMP WITH LOCAL TIME ZONE because
the
main
calculation
should always happen based on UTC. We discussed it in
a
different
thread,
but we should allow PROCTIME globally. People need a
way to
create
instances of TIMESTAMP WITH LOCAL TIME ZONE. This is
not
considered
in the
current design doc. Many pipelines contain UTC
timestamps
and thus
it
should be easy to create one. Also, both
CURRENT_TIMESTAMP
and
LOCALTIMESTAMP can work with this type because we
should
remember
that
TIMESTAMP WITH LOCAL TIME ZONE accepts all timestamp
data
types as
casting
target [1]. We could allow TIMESTAMP WITH TIME ZONE in
the
future
for
ROWTIME.

In any case, windows should simply adapt their
behavior to
the
passed
timestamp type. And with TIMESTAMP WITH LOCAL TIME
ZONE
a
day is
defined by
considering the current session time zone.

If we would like to design this with less effort
required,
we
could
think about returning TIMESTAMP WITH LOCAL TIME ZONE
also
for
CURRENT_TIMESTAMP.


I will try to involve more people into this
discussion.

Thanks,
Timo

[1]






https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/Data-Types.html#GUID-E7CA339A-2093-4FE4-A36E-1D09593591D3
<






https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/Data-Types.html#GUID-E7CA339A-2093-4FE4-A36E-1D09593591D3




2021-01-21,22:32,Leonard Xu <xbjt...@gmail.com> :
Before the changes, as I am writing this reply, the
local
time
here
is
2021-01-21 12:03:35 (Beijing time, UTC+8).
And I tried these 5 functions in sql client, and
got:

Flink SQL> select now(), PROCTIME(),
CURRENT_TIMESTAMP,
CURRENT_DATE,
CURRENT_TIME;







+-------------------------+-------------------------+-------------------------+--------------+--------------+
|                  EXPR$0 |                  EXPR$1
|
CURRENT_TIMESTAMP | CURRENT_DATE | CURRENT_TIME |







+-------------------------+-------------------------+-------------------------+--------------+--------------+
| 2021-01-21T04:03:35.228 | 2021-01-21T04:03:35.228
|
2021-01-21T04:03:35.228 |   2021-01-21 | 04:03:35.228
|







+-------------------------+-------------------------+-------------------------+--------------+--------------+
After the changes, the expected behavior will change
to:

Flink SQL> select now(), PROCTIME(),
CURRENT_TIMESTAMP,
CURRENT_DATE,
CURRENT_TIME;







+-------------------------+-------------------------+-------------------------+--------------+--------------+
|                  EXPR$0 |                  EXPR$1
|
CURRENT_TIMESTAMP | CURRENT_DATE | CURRENT_TIME |







+-------------------------+-------------------------+-------------------------+--------------+--------------+
| 2021-01-21T12:03:35.228 | 2021-01-21T12:03:35.228
|
2021-01-21T12:03:35.228 |   2021-01-21 | 12:03:35.228
|







+-------------------------+-------------------------+-------------------------+--------------+--------------+
The return type of now(), proctime() and
CURRENT_TIMESTAMP still
be
TIMESTAMP;

To Kurt, thanks  for the intuitive case, it really
clear,
you’re
wright
that I want to propose to change the return value of
these
functions.
It’s
the most important part of the topic from user's
perspective.

I think this definitely deserves a FLIP.
To Jark,  nice suggestion, I prepared a FLIP for this
topic, and
will
start the FLIP discussion soon.

If use the default Flink SQL,&nbsp; the window time
range of
the
statistics is incorrect, then the statistical
results
will
naturally
be
incorrect.
To zhisheng, sorry to hear that this problem
influenced
your
production
jobs,  Could you share your SQL pattern?  we can have
more
inputs
and
try
to resolve them.


Best,
Leonard



2021-01-21,14:19,Jark Wu <imj...@gmail.com> :

Great examples to understand the problem and the
proposed
changes,
@Kurt!

Thanks Leonard for investigating this problem.
The time-zone problems around time functions and
windows
have
bothered a
lot of users. It's time to fix them!

The return value changes sound reasonable to me, and
keeping the
return
type unchanged will minimize the surprise to the
users.
Besides that, I think it would be better to mention
how
this
affects
the
window behaviors, and the interoperability with
DataStream.

I think this definitely deserves a FLIP.

====================================================

Hi zhisheng,

Do you have examples to illustrate which case will
get
the
wrong
window
boundaries?
That will help to verify whether the proposed changes
can
solve
your
problem.

Best,
Jark




2021-01-21,12:54,zhisheng <173855...@qq.com> :

Thanks to Leonard Xu for discussing this tricky
topic.
At
present,
there are many Flink jobs in our production
environment
that are
used
to
count day-level reports (eg: count PV/UV ).&nbsp;

If use the default Flink SQL,&nbsp; the window time
range
of the
statistics is incorrect, then the statistical results
will
naturally
be
incorrect.&nbsp;

The user needs to deal with the time zone manually in
order to
solve
the problem.&nbsp;

If Flink itself can solve these time zone issues,
then
I
think it
will
be user-friendly.

Thank you

Best!;
zhisheng




2021-01-21,12:11,Kurt Young <ykt...@gmail.com> :

cc this to user & user-zh mailing list because this
will
affect
lots
of
users, and also quite a lot of users
were asking questions around this topic.

Let me try to understand this from user's
perspective.

Your proposal will affect five functions, which are:
PROCTIME()
NOW()
CURRENT_DATE
CURRENT_TIME
CURRENT_TIMESTAMP
Before the changes, as I am writing this reply, the
local
time
here
is
2021-01-21 12:03:35 (Beijing time, UTC+8).
And I tried these 5 functions in sql client, and got:

Flink SQL> select now(), PROCTIME(),
CURRENT_TIMESTAMP,
CURRENT_DATE,
CURRENT_TIME;







+-------------------------+-------------------------+-------------------------+--------------+--------------+
|                  EXPR$0 |                  EXPR$1 |
CURRENT_TIMESTAMP | CURRENT_DATE | CURRENT_TIME |







+-------------------------+-------------------------+-------------------------+--------------+--------------+
| 2021-01-21T04:03:35.228 | 2021-01-21T04:03:35.228 |
2021-01-21T04:03:35.228 |   2021-01-21 | 04:03:35.228
|







+-------------------------+-------------------------+-------------------------+--------------+--------------+
After the changes, the expected behavior will change
to:

Flink SQL> select now(), PROCTIME(),
CURRENT_TIMESTAMP,
CURRENT_DATE,
CURRENT_TIME;







+-------------------------+-------------------------+-------------------------+--------------+--------------+
|                  EXPR$0 |                  EXPR$1 |
CURRENT_TIMESTAMP | CURRENT_DATE | CURRENT_TIME |







+-------------------------+-------------------------+-------------------------+--------------+--------------+
| 2021-01-21T12:03:35.228 | 2021-01-21T12:03:35.228 |
2021-01-21T12:03:35.228 |   2021-01-21 | 12:03:35.228
|







+-------------------------+-------------------------+-------------------------+--------------+--------------+
The return type of now(), proctime() and
CURRENT_TIMESTAMP
still
be
TIMESTAMP;

Best,
Kurt











Reply via email to