Thanks for the update Danny. +1 for this proposal.

Regards,
Timo

On 26.03.20 04:51, Danny Chan wrote:
Thanks everyone who engaged in this discussion ~

Our goal is "Supports Dynamic Table Options for Flink SQL". After an
offline discussion with Kurt, Timo and Dawid, we have made the final
conclusion, here is the summary:


    - Use comment style syntax to specify the dynamic table options: "/*+
    *OPTIONS*(k1='v1', k2='v2') */"
    - Have constraint on the options keys: the options that may bring in
    security problems should not be allowed, i.e. Kafka connector zookeeper
    endpoint URL and topic name
    - Use white-list to control the allowed options for each connector,
    which is more safe for future extention
    - We allow to enable/disable this feature globally
    - Implement based on the current code base first, and when FLIP-95 is
    checked in, implement this feature based on new interface

Any suggestions are appreciated ~

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+Supports+Dynamic+Table+Options+for+Flink+SQL

Best,
Danny Chan

Jark Wu <imj...@gmail.com> 于2020年3月18日周三 下午10:38写道:

Hi everyone,

Sorry, but I'm not sure about the `supportedHintOptions`. I'm afraid it
doesn't solve the problems but increases some development and learning
burdens.

# increase development and learning burden

According to the discussion so far, we want to support overriding a subset
of options in hints which doesn't affect semantics.
With the `supportedHintOptions`, it's up to the connector developers to
decide which options will not affect semantics, and to be hint options.
However, the question is how to distinguish whether an option will *affect
semantics*? What happens if an option will affect semantics but provided as
hint options?
 From my point of view, it's not easy to distinguish. For example, the
"format.ignore-parse-error" can be a very useful dynamic option but that
will affect semantic, because the result is different (null vs exception).
Another example, the "connector.lookup.cache.*" options are also very
useful to tune jobs, however, it will also affect the job results. I can
come up many more useful options but may affect semantics.

I can see that the community will under endless discussion around "can this
option to be a hint option?",  "wether this option will affect semantics?".
You can also find that we already have different opinions on
"ignore-parse-error". Those discussion is a waste of time! That's not what
users want!
The problem is user need this, this, this options and HOW to expose them?
We should focus on that.

Then there could be two endings in the future:
1) compromise on the usability, we drop the rule that hints don't affect
semantics, allow all the useful options in the hints list.
2) stick on the rule, users will find this is a stumbling feature which
doesn't solve their problems.
     And they will be surprised why this option can't be set, but the other
could. *semantic* is hard to be understood by users.

# doesn't solve the problems

I think the purpose of this FLIP is to allow users to quickly override some
connectors' properties to tune their jobs.
However, `supportedHintOptions` is off track. It only allows a subset
options and for the users it's not *clear* which subset is allowed.

Besides, I'm not sure `supportedHintOptions` can work well for all cases.
How could you support kafka properties (`connector.properties.*`) as hint
options? Some kafka properties may affect semantics (bootstrap.servers),
some may not (max.poll.records). Besides, I think it's not possible to list
all the possible kafka properties [1].

In summary, IMO, `supportedHintOptions`
(1) it increase the complexity to develop a connector
(2) it confuses users which options can be used in hint, which are not,
they have to check the docs again and again.
(3) it doesn't solve the problems which we want to solve by this FLIP.

I think we should avoid introducing some partial solutions. Otherwise, we
will be stuck in a loop that introduce new API -> deprecate API ->
introduce new API....

I personally in favor of an explicit WITH syntax after the table as a part
of the query which is mentioned by Kurt before, e.g. SELECT * from T
WITH('key' = 'value') .
It allows users to dynamically set options which can affect semantics. It
will be very flexible to solve users' problems so far.

Best,
Jark

[1]: https://kafka.apache.org/documentation/#consumerconfigs

On Wed, 18 Mar 2020 at 21:44, Danny Chan <yuzhao....@gmail.com> wrote:

My POC is here for the hints options merge [1].

Personally, I have no strong objections for splitting hints with the
CatalogTable, the only cons is a more complex implementation but the
concept is more clear, and I have updated the WIKI.

I think it would be nice if we can support the format “ignore-parse
error”
option key, the CSV source already has a key [2] and we can use that in
the
supportedHIntOptions, for the common CSV and JSON formats, we cal also
give
a support. This is the only kind of key in formats that “do not change
the
semantics” (somehow), what do you think about this ~

[1]

https://github.com/danny0405/flink/commit/5d925fa16c3c553423c4b7d93001521b8e6e6bee#diff-6e569a6dd124fd2091c18e2790fb49c5
[2]

https://github.com/apache/flink/blob/b83060dff6d403b6994b6646b3f29a374f599530/flink-table/flink-table-api-java-bridge/src/main/java/org/apache/flink/table/sources/CsvTableSourceFactoryBase.java#L92

Best,
Danny Chan
在 2020年3月18日 +0800 PM9:10,Timo Walther <twal...@apache.org>,写道:
Hi everyone,

+1 to Kurt's suggestion. Let's just have it in source and sink
factories
for now. We can still move this method up in the future. Currently, I
don't see a need for catalogs or formats. Because how would you target
a
format in the query?

@Danny: Can you send a link to your PoC? I'm very skeptical about
creating a new CatalogTable in planner. Actually CatalogTable should be
immutable between Catalog and Factory. Because a catalog can return its
own factory and fully control the instantiation. Depending on the
implementation, that means it can be possible that the catalog has
encoded more information in a concrete subclass implementing the
interface. I vote for separating the concerns of catalog information
and
hints in the factory explicitly.

Regards,
Timo


On 18.03.20 05:41, Jingsong Li wrote:
Hi,

I am thinking we can provide hints to *table* related instances.
- TableFormatFactory: of cause we need hints support, there are many
format
options in DDL too.
- catalog and module: I don't know, maybe in future we can provide
some
hints for them.

Best,
Jingsong Lee

On Wed, Mar 18, 2020 at 12:28 PM Danny Chan <yuzhao....@gmail.com>
wrote:

Yes, I think we should move the `supportedHintOptions` from
TableFactory
to TableSourceFactory, and we also need to add the interface to
TableSinkFactory though because sink target table may also have
hints
attached.

Best,
Danny Chan
在 2020年3月18日 +0800 AM11:08,Kurt Young <ykt...@gmail.com>,写道:
Have one question for adding `supportedHintOptions` method to
`TableFactory`. It seems
`TableFactory` is a base factory interface for all *table module*
related
instances, such as
catalog, module, format and so on. It's not created only for
*table*. Is
it
possible to move it
to `TableSourceFactory`?

Best,
Kurt


On Wed, Mar 18, 2020 at 10:59 AM Danny Chan <
yuzhao....@gmail.com>
wrote:

Thanks Timo ~

For the naming itself, I also think the PROPERTIES is not that
concise, so
+1 for OPTIONS (I had thought about that, but there are many
codes in
current Flink called it properties, i.e. the
DescriptorProperties,
#getSupportedProperties), let’s use OPTIONS if this is our new
preference.

+1 to `Set<ConfigOption> supportedHintOptions()` because the
ConfigOption
can take more info. AFAIK, Spark also call their table options
instead
of
properties. [1]

In my local POC, I did create a new CatalogTable, and it works
for
current
connectors well, all the DDL tables would finally yield a
CatalogTable
instance and we can apply the options to that(in the
CatalogSourceTable
when we generating the TableSource), the pros is that we do not
need to
modify the codes of connectors itself. If we split the options
from
CatalogTable, we may need to add some additional logic in each
connector
factories in order to merge these properties (and the logic are
almost
the
same), what do you think about this?

[1]



https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Best,
Danny Chan
在 2020年3月17日 +0800 PM10:10,Timo Walther <twal...@apache.org
,写道:
Hi Danny,

thanks for updating the FLIP. I think your current design is
sufficient
to separate hints from result-related properties.

One remark to the naming itself: I would vote for calling the
hints
around table scan `OPTIONS('k'='v')`. We used the term
"properties"
in
the past but since we want to unify the Flink configuration
experience,
we should use consistent naming and classes around
`ConfigOptions`.

It would be nice to use `Set<ConfigOption>
supportedHintOptions();`
to
start using config options instead of pure string properties.
This
will
also allow us to generate documentation in the future around
supported
data types, ranges, etc. for options. At some point we would
also
like
to drop `DescriptorProperties` class. "Options" is also used
in the
documentation [1] and in the SQL/MED standard [2].

Furthermore, I would still vote for separating CatalogTable
and hint
options. Otherwise the planner would need to create a new
CatalogTable
instance which might not always be easy. We should offer them
via:


org.apache.flink.table.factories.TableSourceFactory.Context#getHints:
ReadableConfig

What do you think?

Regards,
Timo

[1]




https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table
[2] https://wiki.postgresql.org/wiki/SQL/MED


On 12.03.20 15:06, Stephan Ewen wrote:
@Danny sounds good.

Maybe it is worth listing all the classes of problems that
you
want to
address and then look at each class and see if hints are a
good
default
solution or a good optional way of simplifying things?
The discussion has grown a lot and it is starting to be
hard
to
distinguish
the parts where everyone agrees from the parts were there
are
concerns.

On Thu, Mar 12, 2020 at 2:31 PM Danny Chan <
danny0...@apache.org>
wrote:

Thanks Stephan ~

We can remove the support for properties that may change
the
semantics of
query if you think that is a trouble.

How about we support the /*+ properties() */ hint only
for
those
optimize
parameters, such as the fetch size of source or something
like
that,
does
that make sense?

Stephan Ewen <se...@apache.org>于2020年3月12日 周四下午7:45写道:

I think Bowen has actually put it very well.

(1) Hints that change semantics looks like trouble
waiting to
happen. For
example Kafka offset handling should be in filters. The
Kafka
source
should
support predicate pushdown.

(2) Hints should not be a workaround for current
shortcomings.
A
lot of
the
suggested above sounds exactly like that. Working
around
catalog/DDL
shortcomings, missing exposure of metadata (offsets),
missing
predicate
pushdown in Kafka. Abusing a feature like hints now as
a
quick
fix
for
these issues, rather than fixing the root causes, will
much
likely
bite
us
back badly in the future.

Best,
Stephan


On Thu, Mar 12, 2020 at 10:43 AM Kurt Young <
ykt...@gmail.com>
wrote:

It seems this FLIP's name is somewhat misleading.
From
my
understanding,
this FLIP is trying to
address the dynamic parameter issue, and table hints
is the
way
we wan
to
choose. I think we should
be focus on "what's the right way to solve dynamic
property"
instead of
discussing "whether table
hints can affect query semantics".

For now, there are two proposed ways to achieve
dynamic
property:
1. FLIP-110: create temporary table xx like xx with
(xxx)
2. use custom "from t with (xxx)" syntax
3. "Borrow" the table hints to have a special
PROPERTIES
hint.

The first one didn't break anything, but the only
problem i
see
is a
little
more verbose than the table hint
approach. I can imagine when someone using SQL CLI to
have a
sql
experience, it's quite often that
he will modify the table property, some use cases i
can
think of:
1. the source contains some corrupted data, i want to
turn
on the
"ignore-error" flag for certain formats.
2. I have a kafka table and want to see some sample
data
from the
beginning, so i change the offset
to "earliest", and then I want to observe the latest
data
which
keeps
coming in. I would write another query
to select from the latest table.
3. I want to my jdbc sink flush data more eagerly
then
i can
observe
the
data from database side.

Most of such use cases are quite ad-hoc. If every
time
I
want to
have a
different experience, i need to create
a temporary table and then also modify my query, it
doesn't
feel
smooth.
Embed such dynamic property into
query would have better user experience.

Both 2 & 3 can make this happen. The cons of #2 is
breaking
SQL
compliant,
and for #3, it only breaks some
unwritten rules, but we can have an explanation on
that. And
I
really
doubt
whether user would complain about
this when they actually have flexible and good
experience
using
this.

My tendency would be #3 > #1 > #2, what do you think?

Best,
Kurt


On Thu, Mar 12, 2020 at 1:11 PM Danny Chan <
yuzhao....@gmail.com

wrote:

Thanks Aljoscha ~

I agree for most of the query hints, they are
optional as
an
optimizer
instruction, especially for the traditional RDBMS.

But, just like BenChao said, Flink as a computation
engine
has
many
different kind of data sources, thus, dynamic
parameters
like
start_offest
can only bind to each table scope, we can not set a
session
config
like
KSQL because they are all about Kafka:
SET ‘auto.offset.reset’=‘earliest’;

Thus the most flexible way to set up these dynamic
params
is
to bind
to
the table scope in the query when we want to
override
something, so
we
have
these solutions above (with pros and cons from my
side):

• 1. Select * from t(offset=123) (from Timo)

Pros:
- Easy to add
- Parameters are part of the main query
Cons:
- Not SQL compliant


• 2. Select * from t /*+ PROPERTIES(offset=123) */
(from
me)

Pros:
- Easy to add
- SQL compliant because it is nested in the
comments

Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users

The biggest problem for hints way may be the “if
hints
must be
optional”,
actually we have though about 1 for a while but
aborted
because it
breaks
the SQL standard too much. And we replace it with
2,
because
the
hints
syntax do not break SQL standard(nested in
comments).

What if we have the special /*+ PROPERTIES */ hint
that
allows
override
some properties of table dynamically, it does not
break
anything, at
lease
for current Flink use cases.

Planner hints are optional just because they are
naturally
enforcers
of
the planner, most of them aim to instruct the
optimizer,
but,
the
table
hints is a little different, table hints can
specify
the
table
meta
like
index column, and it is very convenient to specify
table
properties.

Or shall we not call /*+ PROPERTIES(offset=123) */
table
hint,
we
can
call it table dynamic parameters.

Best,
Danny Chan
在 2020年3月11日 +0800 PM9:20,Aljoscha Krettek <
aljos...@apache.org>,写道:
Hi,

I don't understand this discussion. Hints, as I
understand
them,
should
work like this:

- hints are *optional* advice for the optimizer
to
try
and
help it
to
find a good execution strategy
- hints should not change query semantics, i.e.
they
should
not
change
connector properties executing a query with
taking
into
account the
hints *must* produce the same result as executing
the
query
without
taking into account the hints

 From these simple requirements you can derive a
solution
that makes
sense. I don't have a strong preference for the
syntax
but we
should
strive to be in line with prior work.

Best,
Aljoscha

On 11.03.20 11:53, Danny Chan wrote:
Thanks Timo for summarize the 3 options ~

I agree with Kurt that option2 is too
complicated to
use
because:

• As a Kafka topic consumer, the user must
define both
the
virtual
column for start offset and he must apply a special
filter
predicate
after
each query
• And for the internal implementation, the
metadata
column
push
down
is another hard topic, each kind of message queue
may have
its
offset
attribute, we need to consider the expression type
for
different
kind;
the
source also need to recognize the constant column
as
a
config
option(which
is weird because usually what we pushed down is a
table
column)

For option 1 and option3, I think there is no
difference,
option1
is
also a hint syntax which is introduced in Sybase
and
referenced then
deprecated by MS-SQL in 199X years because of the
ambitiousness.
Personally
I prefer /*+ */ style table hint than WITH keyword
for
these
reasons:

• We do not break the standard SQL, the hints
are
nested
in SQL
comments
• We do not need to introduce additional WITH
keyword
which may
appear
in a query if we use that because a table can be
referenced in
all
kinds
of
SQL contexts: INSERT/DELETE/FROM/JOIN …. That would
make
our
sql
query
break too much of the SQL from standard
• We would have uniform syntax for hints as
query
hint, one
syntax
fits all and more easy to use


And here is the reason why we choose a uniform
Oracle
style query
hint syntax which is addressed by Julian Hyde when
we
design
the
syntax
from the Calcite community:

I don’t much like the MSSQL-style syntax for
table
hints.
It
adds a
new use of the WITH keyword that is unrelated to
the
use of
WITH for
common-table expressions.

A historical note. Microsoft SQL Server
inherited its
hint
syntax
from
Sybase a very long time ago. (See “Transact SQL
Programming”[1], page
632,
“Optimizer hints”. The book was written in 1999,
and
covers
Microsoft
SQL
Server 6.5 / 7.0 and Sybase Adaptive Server 11.5,
but the
syntax very
likely predates Sybase 4.3, from which Microsoft
SQL
Server was
forked
in
1993.)

Microsoft later added the WITH keyword to make
it less
ambiguous,
and
has now deprecated the syntax that does not use
WITH.

They are forced to keep the syntax for
backwards
compatibility
but
that doesn’t mean that we should shoulder their
burden.

I think formatted comments are the right
container for
hints
because
it allows us to change the hint syntax without
changing
the SQL
parser,
and
makes clear that we are at liberty to ignore hints
entirely.

Julian

[1] https://www.amazon.com/s?k=9781565924017 <
https://www.amazon.com/s?k=9781565924017>

Best,
Danny Chan
在 2020年3月11日 +0800 PM4:03,Timo Walther <
twal...@apache.org
,写道:
Hi Danny,

it is true that our DDL is not standard
compliant by
using the
WITH
clause. Nevertheless, we aim for not
diverging
too
much
and the
LIKE
clause is an example of that. It will solve
things
like
overwriting
WATERMARKs, add additional/modifying
properties and
inherit
schema.

Bowen is right that Flink's DDL is mixing 3
types
definition
together.
We are not the first ones that try to solve
this.
There
is also
the
SQL
MED standard [1] that tried to tackle this
problem. I
think it
was
not
considered when designing the current DDL.

Currently, I see 3 options for handling Kafka
offsets. I
will
give
some
examples and look forward to feedback here:

*Option 1* Runtime and semantic parms as part
of the
query

`SELECT * FROM MyTable('offset'=123)`

Pros:
- Easy to add
- Parameters are part of the main query
- No complicated hinting syntax

Cons:
- Not SQL compliant

*Option 2* Use metadata in query

`CREATE TABLE MyTable (id INT, offset AS
SYSTEM_METADATA('offset'))`

`SELECT * FROM MyTable WHERE offset >
TIMESTAMP
'2012-12-12
12:34:22'`

Pros:
- SQL compliant in the query
- Access of metadata in the DDL which is
required
anyway
- Regular pushdown rules apply

Cons:
- Users need to add an additional comlumn in
the DDL

*Option 3*: Use hints for properties

`
SELECT *
FROM MyTable /*+ PROPERTIES('offset'=123) */
`

Pros:
- Easy to add

Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users
- Not standard compliant.

If we go with this option, I would suggest to
make it
available
in
a
separate map and don't mix it with statically
defined
properties.
Such
that the factory can decide which properties
have the
right to
be
overwritten by the hints:
TableSourceFactory.Context.getQueryHints():
ReadableConfig

Regards,
Timo

[1] https://en.wikipedia.org/wiki/SQL/MED

Currently I see 3 options as a


On 11.03.20 07:21, Danny Chan wrote:
Thanks Bowen ~

I agree we should somehow categorize our
connector
parameters.

For type1, I’m already preparing a solution
like
the
Confluent
schema registry + Avro schema inference thing, so
this may
not
be a
problem
in the near future.

For type3, I have some questions:

"SELECT * FROM mykafka WHERE offset >
12pm
yesterday”

Where does the offset column come from, a
virtual
column from
the
table schema, you said that

They change
almost every time a query starts and have
nothing
to
do with
metadata, thus
should not be part of table definition/DDL

But why you can reference it in the query,
I’m
confused for
that,
can you elaborate a little ?

Best,
Danny Chan
在 2020年3月11日 +0800 PM12:52,Bowen Li <
bowenl...@gmail.com
,写道:
Thanks Danny for kicking off the effort

The root cause of too much manual work is
Flink
DDL
has
mixed 3
types of
params together and doesn't handle each
of
them
very
well.
Below
are how I
categorize them and corresponding
solutions in my
mind:

- type 1: Metadata of external data, like
external
endpoint/url,
username/pwd, schemas, formats.

Such metadata are mostly already
accessible in
external
system
as long as
endpoints and credentials are provided.
Flink can
get it
thru
catalogs, but
we haven't had many catalogs yet and thus
Flink
just
hasn't
been
able to
leverage that. So the solution should be
building
more
catalogs.
Such
params should be part of a Flink table
DDL/definition, and
not
overridable
in any means.


- type 2: Runtime params, like jdbc
connector's
fetch size,
elasticsearch
connector's bulk flush size.

Such params don't affect query results,
but
affect
how
results
are produced
(eg. fast or slow, aka performance) -
they
are
essentially
execution and
implementation details. They change often
in
exploration or
development
stages, but not quite frequently in
well-defined
long-running
pipelines.
They should always have default values
and
can be
missing
in
query. They
can be part of a table DDL/definition,
but
should
also be
replaceable in a
query - *this is what table "hints" in
FLIP-113
should
cover*.


- type 3: Semantic params, like kafka
connector's
start
offset.

Such params affect query results - the
semantics.
They'd
better
be as
filter conditions in WHERE clause that
can
be
pushed
down.
They
change
almost every time a query starts and have
nothing to
do
with
metadata, thus
should not be part of table
definition/DDL, nor
be
persisted
in
catalogs.
If they will, users should create views
to
keep
such
params
around (note
this is different from variable
substitution).


Take Flink-Kafka as an example. Once we
get these
params
right,
here're the
steps users need to do to develop and run
a Flink
job:
- configure a Flink
ConfluentSchemaRegistry with
url,
username,
and password
- run "SELECT * FROM mykafka WHERE offset
12pm
yesterday"
(simplified
timestamp) in SQL CLI, Flink
automatically
retrieves
all
metadata of
schema, file format, etc and start the
job
- users want to make the job read Kafka
topic
faster, so it
goes
as "SELECT
* FROM mykafka /* faster_read_key=value*/
WHERE
offset >
12pm
yesterday"
- done and satisfied, users submit it to
production


Regarding "CREATE TABLE t LIKE with
(k1=v1,
k2=v2),
I think
it's
a
nice-to-have feature, but not a
strategically
critical,
long-term solution,
because
1) It may seem promising at the current
stage to
solve the
too-much-manual-work problem, but that's
only
because Flink
hasn't
leveraged catalogs well and handled the 3
types
of
params
above
properly.
Once we get the params types right, the
LIKE
syntax
won't
be
that
important, and will be just an easier way
to
create
tables
without retyping
long fields like username and pwd.
2) Note that only some rare type of
catalog can
store k-v
property pair, so
table created this way often cannot be
persisted. In
the
foreseeable
future, such catalog will only be
HiveCatalog,
and
not
everyone
has a Hive
metastore. To be honest, without
persistence,
recreating
tables
every time
this way is still a lot of keyboard
typing.

Cheers,
Bowen

On Tue, Mar 10, 2020 at 8:07 PM Kurt
Young
<
ykt...@gmail.com

wrote:

If a specific connector want to have
such
parameter and
read
if out of
configuration, then that's fine.
If we are talking about a configuration
for all
kinds of
sources, I would
be super careful about that.
It's true it can solve maybe 80% cases,
but it
will also
make
the left 20%
feels weird.

Best,
Kurt


On Wed, Mar 11, 2020 at 11:00 AM Jark
Wu
<
imj...@gmail.com

wrote:

Hi Kurt,

#3 Regarding to global offset:
I'm not saying to use the global
configuration to
override
connector
properties by the planner.
But the connector should take this
configuration
and
translate into their
client API.
AFAIK, almost all the message queues
support
eariliest
and
latest and a
timestamp value as start point.
So we can support 3 options for this
configuration:
"eariliest", "latest"
and a timestamp string value.
Of course, this can't solve 100%
cases, but I
guess can
sovle 80% or 90%
cases.
And the remaining cases can be
resolved by
LIKE
syntax
which
I guess is
not
very common cases.

Best,
Jark


On Wed, 11 Mar 2020 at 10:33, Kurt
Young <
ykt...@gmail.com

wrote:

Good to have such lovely
discussions. I
also
want to
share
some of my
opinions.

#1 Regarding to error handling: I
also
think
ignore
invalid hints would
be
dangerous, maybe
the simplest solution is just throw
an
exception.

#2 Regarding to property
replacement: I
don't
think
we
should
constraint
ourself to
the meaning of the word "hint", and
forbidden
it
modifying
any
properties
which can effect
query results. IMO `PROPERTIES` is
one of
the
table
hints,
and a
powerful
one. It can
modify properties located in DDL's
WITH
block.
But I
also
see the harm
that
if we make it
too flexible like change the kafka
topic
name
with a
hint.
Such use
case
is
not common and
sounds very dangerous to me. I
would
propose
we have
a
map
of hintable
properties for each
connector, and should validate all
passed
in
properties
are actually
hintable. And combining with
#1 error handling, we can throw an
exception
once
received
invalid
property.

#3 Regarding to global offset: I'm
not sure
it's
feasible.
Different
connectors will have totally
different properties to represent
offset,
some
might
be
timestamps,
some
might be string literals
like "earliest", and others might
be
just
integers.

Best,
Kurt


On Tue, Mar 10, 2020 at 11:46 PM
Jark Wu <
imj...@gmail.com>
wrote:

Hi everyone,

I want to jump in the discussion
about
the
"dynamic
start offset"
problem.
First of all, I share the same
concern
with
Timo
and
Fabian, that the
"start offset" affects the query
semantics,
i.e.
the
query result.
But "hints" is just used for
optimization
which
should
affect the
result?

I think the "dynamic start
offset"
is an
very
important
usability
problem
which will be faced by many
streaming
platforms.
I also agree "CREATE TEMPORARY
TABLE Temp
(LIKE t)
WITH

('connector.startup-timestamp-millis' =
'1578538374471')" is verbose,
what
if we have 10 tables to join?

However, what I want to propose
(should
be
another
thread) is a
global
configuration to reset start
offsets of
all
the
source
connectors
in the query session, e.g.
"table.sources.start-offset".
This is
possible
now because
`TableSourceFactory.Context`
has
`getConfiguration`
method to get the session
configuration,
and
use it
to
create an
adapted
TableSource.
Then we can also expose to SQL
CLI
via
SET
command,
e.g.
`SET

'table.sources.start-offset'='earliest';`,
which is
pretty simple and
straightforward.

This is very similar to KSQL's
`SET
'auto.offset.reset'='earliest'`
which
is very helpful IMO.

Best,
Jark


On Tue, 10 Mar 2020 at 22:29,
Timo
Walther <
twal...@apache.org>
wrote:

Hi Danny,

compared to the hints, FLIP-110
is
fully
compliant
to
the SQL
standard.

I don't think that `CREATE
TEMPORARY
TABLE
Temp
(LIKE
t) WITH
(k=v)`
is
too verbose or awkward for the
power of
basically
changing the
entire
connector. Usually, this
statement
would
just
precede
the query in
a
multiline file. So it can be
change
"in-place"
like
the hints you
proposed.

Many companies have a
well-defined set
of
tables
that
should be
used.
It
would be dangerous if users can
change
the
path
or
topic in a hint.
The
catalog/catalog manager should
be the
entity that
controls which
tables
exist and how they can be
accessed.

what’s the problem there if
we
user
the
table
hints
to support
“start
offset”?

IMHO it violates the meaning of
a hint.
According
to
the
dictionary,
a
hint is "a statement that
expresses
indirectly
what
one prefers not
to
say explicitly". But offsets
are
a
property that
are
very explicit.

If we go with the hint
approach,
it
should
be
expressible in the
TableSourceFactory which
properties are
supported
for
hinting. Or
do
you
plan to offer those hints in a
separate
Map<String,
String> that
cannot
overwrite existing properties?
I
think
this would
be
a
different
story...

Regards,
Timo


On 10.03.20 10:34, Danny Chan
wrote:
Thanks Timo ~

Personally I would say that
offset >
0
and
start
offset = 10 does
not
have the same semantic, so from
the SQL
aspect,
we
can
not
implement
a
“starting offset” hint for
query
with
such
a
syntax.

And the CREATE TABLE LIKE
syntax is a
DDL which
is
just verbose
for
defining such dynamic
parameters
even
if
it could
do
that, shall we
force
users to define a temporal
table
for
each
query
with
dynamic
params,
I
would say it’s an awkward
solution.

"Hints should give "hints"
but
not
affect the
actual
produced
result.”
You mentioned that multiple
times and
could we
give a
reason,
what’s
the
problem there if we user the
table
hints to
support
“start offset”
?
From
my side I saw some benefits for
that:


• It’s very convent to set up
these
parameters,
the
syntax is
very
much
like the DDL definition
• It’s scope is very clear,
right on
the
table
it
attathed
• It does not affect the
table
schema,
which
means
in order to
specify
the offset, there is no need to
define
an
offset
column which is
weird
actually, offset should never
be
a
column,
it’s
more
like a
metadata
or a
start option.

So in total, FLIP-110 uses
the
offset
more
like a
Hive partition
prune,
we can do that if we have an
offset
column, but
most
of the case we
do
not
define that, so there is
actually no
conflict or
overlap.

Best,
Danny Chan
在 2020年3月10日 +0800
PM4:28,Timo
Walther <
twal...@apache.org>,写道:
Hi Danny,

shouldn't FLIP-110[1] solve
most
of the
problems
we have around
defining
table properties more
dynamically
without
manual
schema work?
Also
offset definition is easier
with
such a
syntax.
They must not be
defined
in catalog but could be
temporary
tables that
extend from the
original
table.

In general, we should aim
to
keep
the
syntax
concise and don't
provide
too many ways of doing the
same
thing.
Hints
should give "hints"
but
not
affect the actual produced
result.

Some connector properties
might
also
change
the
plan or schema
in
the
future. E.g. they might
also
define
whether a
table source
supports
certain push-downs (e.g.
predicate
push-down).

Dawid is currently working
a
draft
that might
makes it possible
to
expose a Kafka offset via
the
schema
such
that
`SELECT * FROM
Topic
WHERE offset > 10` would
become
possible and
could
be pushed
down.
But
this is of course, not
planned
initially.

Regards,
Timo


[1]













https://cwiki.apache.org/confluence/display/FLINK/FLIP-110%3A+Support+LIKE+clause+in+CREATE+TABLE



On 10.03.20 08:34, Danny
Chan
wrote:
Thanks Wenlong ~

For PROPERTIES Hint Error
handling

Actually we have no way
to
figure out
whether a
error prone
hint
is a
PROPERTIES hint, for example,
if
use
writes a
hint
like
‘PROPERTIAS’,
we
do
not know if this hint is a
PROPERTIES
hint, what
we
know is that
the
hint
name was not registered in our
Flink.

If the user writes the
hint name
correctly
(i.e.
PROPERTIES),
we
did
can enforce the validation of
the hint
options
though
the pluggable
HintOptionChecker.

For PROPERTIES Hint
Option
Format

For a key value style
hint
option,
the key
can
be either a
simple
identifier or a string literal,
which
means that
it’s
compatible
with
our
DDL syntax. We support simple
identifier
because
many
other hints
do
not
have the component complex keys
like
the
table
properties, and we
want
to
unify the parse block.

Best,
Danny Chan
在 2020年3月10日 +0800
PM3:19,wenlong.lwl <
wenlong88....@gmail.com
,写道:
Hi Danny, thanks for
the
proposal.
+1 for
adding table hints,
it
is
really
a necessary feature for
flink
sql
to
integrate
with a catalog.

For error handling, I
think it
would be
more
natural to throw
an
exception when error
table hint
provided,
because the
properties
in
hint
will be merged and used
to find
the table
factory which would
cause
an
exception when error
properties
provided,
right? On the other
hand,
unlike
other hints which just
affect
the
way to
execute the query,
the
property
table hint actually
affects the
result of
the
query, we should
never
ignore
the given property
hints.

For the format of
property
hints,
currently,
in sql client, we
accept
properties in format of
string
only in
DDL:
'connector.type'='kafka',
I
think the format of
properties
in
hint
should
be the same as
the
format we
defined in ddl. What do
you
think?

Bests,
Wenlong Lyu

On Tue, 10 Mar 2020 at
14:22,
Danny Chan
<
yuzhao....@gmail.com>
wrote:

To Weike: About the
Error
Handing

To be consistent with
other
SQL
vendors,
the
default is to
log
warnings
and if there is any
error
(invalid hint
name
or options), the
hint
is just
ignored. I have
already
addressed in
the
wiki.

To Timo: About the
PROPERTIES
Table
Hint

• The properties
hints
is
also
optional,
user can pass in an
option
to
override the table
properties
but this
does
not mean it is
required.
• They should not
include
semantics:
does
the properties
belong
to
semantic ? I don't
think so,
the
plan
does
not change right ?
The
result
set may be affected,
but
there
are
already
some hints do so,
for
example,
MS-SQL MAXRECURSION
and
SNAPSHOT
hint
[1]
• `SELECT * FROM
t(k=v,
k=v)`:
this
grammar
breaks the SQL
standard
compared to the hints
way(which
is
included
in comments)
• I actually didn't
found any
vendors
to
support such
grammar,
and
there
is no way to override
table
level
properties
dynamically. For
normal
RDBMS,
I think there are no
requests
for such
dynamic parameters
because
all the
table have the same
storage
and
computation
and they are
almost
all
batch
tables.
• While Flink as a
computation
engine
has
many connectors,
especially for
some message queue
like
Kafka,
we would
have
a start_offset
which
is
different each time
we
start
the
query,
such
parameters can
not
be
persisted to catalog,
because
it’s not
static, this is
actually
the
background we propose
the
table
hints
to
indicate such
properties
dynamically.


To Jark and Jinsong:
I
have
removed the
query hints part and
change
the
title.

[1]













https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15

Best,
Danny Chan
在 2020年3月9日 +0800
PM5:46,Timo
Walther <
twal...@apache.org
,写道:
Hi Danny,

thanks for the
proposal. I
agree with
Jark
and Jingsong.
Planner
hints
and table hints are
orthogonal
topics
that
should be
discussed
separately.

I share Jingsong's
opinion
that we
should
not use planner
hints
for
passing connector
properties.
Planner
hints should be
optional
at
any
time. They should
not
include
semantics
but only affect
execution
time.
Connector
properties
are an
important
part
of the query
itself.

Have you thought
about
options
such
as
`SELECT * FROM t(k=v,
k=v)`?
How
are other vendors
deal with
this
problem?

Regards,
Timo


On 09.03.20 10:37,
Jingsong Li
wrote:
Hi Danny, +1 for
table
hints,
thanks
for
driving.

I took a look to
FLIP,
most
of
content
are talking about
query
hints.
It is
hard to
discussion
and
voting. So
+1
to
split it as Jark
said.

Another thing is
configuration that
suitable to config with
table
hints:
"connector.path"
and
"connector.topic",
Are they really
suitable
for
table
hints? Looks
weird
to me.
Because I
think these properties
are
the
core of
table.

Best,
Jingsong Lee

On Mon, Mar 9,
2020 at
5:30
PM Jark
Wu
<
imj...@gmail.com>
wrote:

Thanks Danny
for
starting
the
discussion.
+1 for this
feature.

If we just
focus
on the
table
hints
not the query hints in
this
release,
could you split
the
FLIP
into two
FLIPs?
Because it's
hard to
vote
on
partial
part of a FLIP. You
can
keep
the table
hints proposal
in
FLIP-113
and
move
query hints into
another
FLIP.
So that we can
focuse
on
the
table
hints in the FLIP.

Thanks,
Jark



On Mon, 9 Mar
2020 at
17:14,
DONG,
Weike <
kyled...@connect.hku.hk

wrote:

Hi Danny,

This is a
nice
feature,
+1.

One thing I
am
interested in
but
not
mentioned in the
proposal
is
the
error
handling, as
it is
quite
common
for
users to write
inappropriate
hints in
SQL code, if
illegal
or
"bad"
hints
are given, would the
system
simply
ignore them
or
throw
exceptions?

Thanks : )

Best,
Weike

On Mon, Mar
9,
2020
at
5:02 PM
Danny
Chan <
yuzhao....@gmail.com>
wrote:

Note:
we only
plan
to
support table
hints in Flink release
1.11,
so
please
focus
mainly on
the table
hints
part
and
just ignore the
planner
hints, sorry
for
that
mistake
~

Best,
Danny Chan
在 2020年3月9日
+0800
PM4:36,Danny
Chan <
yuzhao....@gmail.com
,写道:
Hi,
fellows ~

I would
like to
propose the
supports for SQL hints for
our
Flink SQL.

We would
support
hints
syntax
as
following:

select
/*+
NO_HASH_JOIN,
RESOURCE(mem='128mb',
parallelism='24') */
from
emp /*+
INDEX(idx1,
idx2)
*/
join
dept /*+
PROPERTIES(k1='v1',
k2='v2') */
on

emp.deptno
=
dept.deptno

Basically
we
would
support
both
query hints(after the
SELECT
keyword)
and table
hints(after
the
referenced table name), for
1.11,
we
plan to
only
support
table hints
with a
hint
probably named
PROPERTIES:


table_name
/*+
PROPERTIES(k1='v1', k2='v2') *+/

I am
looking
forward
to
your
comments.

You can
access
the
FLIP
here:

















https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+SQL+and+Planner+Hints

Best,
Danny
Chan



































Reply via email to