Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Timo Walther Thu, 26 Mar 2020 01:41:15 -0700

Thanks for the update Danny. +1 for this proposal.

Regards,
Timo


On 26.03.20 04:51, Danny Chan wrote:

Thanks everyone who engaged in this discussion ~

Our goal is "Supports Dynamic Table Options for Flink SQL". After an
offline discussion with Kurt, Timo and Dawid, we have made the final
conclusion, here is the summary:


    - Use comment style syntax to specify the dynamic table options: "/*+
    *OPTIONS*(k1='v1', k2='v2') */"
    - Have constraint on the options keys: the options that may bring in
    security problems should not be allowed, i.e. Kafka connector zookeeper
    endpoint URL and topic name
    - Use white-list to control the allowed options for each connector,
    which is more safe for future extention
    - We allow to enable/disable this feature globally
    - Implement based on the current code base first, and when FLIP-95 is
    checked in, implement this feature based on new interface

Any suggestions are appreciated ~

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+Supports+Dynamic+Table+Options+for+Flink+SQL

Best,
Danny Chan

Jark Wu <[email protected]> 于2020年3月18日周三 下午10:38写道：

Hi everyone,

Sorry, but I'm not sure about the `supportedHintOptions`. I'm afraid it
doesn't solve the problems but increases some development and learning
burdens.

# increase development and learning burden

According to the discussion so far, we want to support overriding a subset
of options in hints which doesn't affect semantics.
With the `supportedHintOptions`, it's up to the connector developers to
decide which options will not affect semantics, and to be hint options.
However, the question is how to distinguish whether an option will *affect
semantics*? What happens if an option will affect semantics but provided as
hint options?
 From my point of view, it's not easy to distinguish. For example, the
"format.ignore-parse-error" can be a very useful dynamic option but that
will affect semantic, because the result is different (null vs exception).
Another example, the "connector.lookup.cache.*" options are also very
useful to tune jobs, however, it will also affect the job results. I can
come up many more useful options but may affect semantics.

I can see that the community will under endless discussion around "can this
option to be a hint option?",  "wether this option will affect semantics?".
You can also find that we already have different opinions on
"ignore-parse-error". Those discussion is a waste of time! That's not what
users want!
The problem is user need this, this, this options and HOW to expose them?
We should focus on that.

Then there could be two endings in the future:
1) compromise on the usability, we drop the rule that hints don't affect
semantics, allow all the useful options in the hints list.
2) stick on the rule, users will find this is a stumbling feature which
doesn't solve their problems.
     And they will be surprised why this option can't be set, but the other
could. *semantic* is hard to be understood by users.

# doesn't solve the problems

I think the purpose of this FLIP is to allow users to quickly override some
connectors' properties to tune their jobs.
However, `supportedHintOptions` is off track. It only allows a subset
options and for the users it's not *clear* which subset is allowed.

Besides, I'm not sure `supportedHintOptions` can work well for all cases.
How could you support kafka properties (`connector.properties.*`) as hint
options? Some kafka properties may affect semantics (bootstrap.servers),
some may not (max.poll.records). Besides, I think it's not possible to list
all the possible kafka properties [1].

In summary, IMO, `supportedHintOptions`
(1) it increase the complexity to develop a connector
(2) it confuses users which options can be used in hint, which are not,
they have to check the docs again and again.
(3) it doesn't solve the problems which we want to solve by this FLIP.

I think we should avoid introducing some partial solutions. Otherwise, we
will be stuck in a loop that introduce new API -> deprecate API ->
introduce new API....

I personally in favor of an explicit WITH syntax after the table as a part
of the query which is mentioned by Kurt before, e.g. SELECT * from T
WITH('key' = 'value') .
It allows users to dynamically set options which can affect semantics. It
will be very flexible to solve users' problems so far.

Best,
Jark

[1]: https://kafka.apache.org/documentation/#consumerconfigs

On Wed, 18 Mar 2020 at 21:44, Danny Chan <[email protected]> wrote:

My POC is here for the hints options merge [1].

Personally, I have no strong objections for splitting hints with the
CatalogTable, the only cons is a more complex implementation but the
concept is more clear, and I have updated the WIKI.

I think it would be nice if we can support the format “ignore-parse

error”

option key, the CSV source already has a key [2] and we can use that in

the

supportedHIntOptions, for the common CSV and JSON formats, we cal also

give

a support. This is the only kind of key in formats that “do not change

the

semantics” (somehow), what do you think about this ~

[1]

https://github.com/danny0405/flink/commit/5d925fa16c3c553423c4b7d93001521b8e6e6bee#diff-6e569a6dd124fd2091c18e2790fb49c5

[2]

https://github.com/apache/flink/blob/b83060dff6d403b6994b6646b3f29a374f599530/flink-table/flink-table-api-java-bridge/src/main/java/org/apache/flink/table/sources/CsvTableSourceFactoryBase.java#L92


Best,
Danny Chan
在 2020年3月18日 +0800 PM9:10，Timo Walther <[email protected]>，写道：

Hi everyone,

+1 to Kurt's suggestion. Let's just have it in source and sink

factories

for now. We can still move this method up in the future. Currently, I
don't see a need for catalogs or formats. Because how would you target

format in the query?

@Danny: Can you send a link to your PoC? I'm very skeptical about
creating a new CatalogTable in planner. Actually CatalogTable should be
immutable between Catalog and Factory. Because a catalog can return its
own factory and fully control the instantiation. Depending on the
implementation, that means it can be possible that the catalog has
encoded more information in a concrete subclass implementing the
interface. I vote for separating the concerns of catalog information

and

hints in the factory explicitly.

Regards,
Timo


On 18.03.20 05:41, Jingsong Li wrote:

Hi,

I am thinking we can provide hints to *table* related instances.
- TableFormatFactory: of cause we need hints support, there are many

format

options in DDL too.
- catalog and module: I don't know, maybe in future we can provide

some

hints for them.

Best,
Jingsong Lee

On Wed, Mar 18, 2020 at 12:28 PM Danny Chan <[email protected]>

wrote:

Yes, I think we should move the `supportedHintOptions` from

TableFactory

to TableSourceFactory, and we also need to add the interface to
TableSinkFactory though because sink target table may also have

hints

attached.

Best,
Danny Chan
在 2020年3月18日 +0800 AM11:08，Kurt Young <[email protected]>，写道：

Have one question for adding `supportedHintOptions` method to
`TableFactory`. It seems
`TableFactory` is a base factory interface for all *table module*

related

instances, such as
catalog, module, format and so on. It's not created only for

*table*. Is

it

possible to move it
to `TableSourceFactory`?

Best,
Kurt


On Wed, Mar 18, 2020 at 10:59 AM Danny Chan <

[email protected]>

wrote:

Thanks Timo ~

For the naming itself, I also think the PROPERTIES is not that

concise, so

+1 for OPTIONS (I had thought about that, but there are many

codes in

current Flink called it properties, i.e. the

DescriptorProperties,

#getSupportedProperties), let’s use OPTIONS if this is our new

preference.


+1 to `Set<ConfigOption> supportedHintOptions()` because the

ConfigOption

can take more info. AFAIK, Spark also call their table options

instead

of

properties. [1]

In my local POC, I did create a new CatalogTable, and it works

for

current

connectors well, all the DDL tables would finally yield a

CatalogTable

instance and we can apply the options to that(in the

CatalogSourceTable

when we generating the TableSource), the pros is that we do not

need to

modify the codes of connectors itself. If we split the options

from

CatalogTable, we may need to add some additional logic in each

connector

factories in order to merge these properties (and the logic are

almost

the

same), what do you think about this?

[1]

https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html


Best,
Danny Chan
在 2020年3月17日 +0800 PM10:10，Timo Walther <[email protected]

，写道：

Hi Danny,

thanks for updating the FLIP. I think your current design is

sufficient

to separate hints from result-related properties.

One remark to the naming itself: I would vote for calling the

hints

around table scan `OPTIONS('k'='v')`. We used the term

"properties"

in

the past but since we want to unify the Flink configuration

experience,

we should use consistent naming and classes around

`ConfigOptions`.


It would be nice to use `Set<ConfigOption>

supportedHintOptions();`

to

start using config options instead of pure string properties.

This

will

also allow us to generate documentation in the future around

supported

data types, ranges, etc. for options. At some point we would

also

like

to drop `DescriptorProperties` class. "Options" is also used

in the

documentation [1] and in the SQL/MED standard [2].

Furthermore, I would still vote for separating CatalogTable

and hint

options. Otherwise the planner would need to create a new

CatalogTable

instance which might not always be easy. We should offer them

via:

org.apache.flink.table.factories.TableSourceFactory.Context#getHints:

ReadableConfig

What do you think?

Regards,
Timo

[1]

https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/create.html#create-table

[2] https://wiki.postgresql.org/wiki/SQL/MED


On 12.03.20 15:06, Stephan Ewen wrote:

@Danny sounds good.

Maybe it is worth listing all the classes of problems that

you

want to

address and then look at each class and see if hints are a

good

default

solution or a good optional way of simplifying things?
The discussion has grown a lot and it is starting to be

hard

to

distinguish

the parts where everyone agrees from the parts were there

are

concerns.


On Thu, Mar 12, 2020 at 2:31 PM Danny Chan <

[email protected]>

wrote:

Thanks Stephan ~

We can remove the support for properties that may change

the

semantics of

query if you think that is a trouble.

How about we support the /*+ properties() */ hint only

for

those

optimize

parameters, such as the fetch size of source or something

like

that,

does

that make sense?

Stephan Ewen <[email protected]>于2020年3月12日 周四下午7:45写道：

I think Bowen has actually put it very well.

(1) Hints that change semantics looks like trouble

waiting to

happen. For

example Kafka offset handling should be in filters. The

Kafka

source

should

support predicate pushdown.

(2) Hints should not be a workaround for current

shortcomings.

lot of

the

suggested above sounds exactly like that. Working

around

catalog/DDL

shortcomings, missing exposure of metadata (offsets),

missing

predicate

pushdown in Kafka. Abusing a feature like hints now as

quick

fix

for

these issues, rather than fixing the root causes, will

much

likely

bite

us

back badly in the future.

Best,
Stephan


On Thu, Mar 12, 2020 at 10:43 AM Kurt Young <

[email protected]>

wrote:

It seems this FLIP's name is somewhat misleading.

From

my

understanding,

this FLIP is trying to
address the dynamic parameter issue, and table hints

is the

way

we wan

to

choose. I think we should
be focus on "what's the right way to solve dynamic

property"

instead of

discussing "whether table
hints can affect query semantics".

For now, there are two proposed ways to achieve

dynamic

property:

1. FLIP-110: create temporary table xx like xx with

(xxx)

2. use custom "from t with (xxx)" syntax
3. "Borrow" the table hints to have a special

PROPERTIES

hint.


The first one didn't break anything, but the only

problem i

see

is a

little

more verbose than the table hint
approach. I can imagine when someone using SQL CLI to

have a

sql

experience, it's quite often that
he will modify the table property, some use cases i

can

think of:

1. the source contains some corrupted data, i want to

turn

on the

"ignore-error" flag for certain formats.
2. I have a kafka table and want to see some sample

data

from the

beginning, so i change the offset
to "earliest", and then I want to observe the latest

data

which

keeps

coming in. I would write another query
to select from the latest table.
3. I want to my jdbc sink flush data more eagerly

then

i can

observe

the

data from database side.

Most of such use cases are quite ad-hoc. If every

time

want to

have a

different experience, i need to create
a temporary table and then also modify my query, it

doesn't

feel

smooth.

Embed such dynamic property into
query would have better user experience.

Both 2 & 3 can make this happen. The cons of #2 is

breaking

SQL

compliant,

and for #3, it only breaks some
unwritten rules, but we can have an explanation on

that. And

really

doubt

whether user would complain about
this when they actually have flexible and good

experience

using

this.


My tendency would be #3 > #1 > #2, what do you think?

Best,
Kurt


On Thu, Mar 12, 2020 at 1:11 PM Danny Chan <

[email protected]

wrote:

Thanks Aljoscha ~

I agree for most of the query hints, they are

optional as

an

optimizer

instruction, especially for the traditional RDBMS.

But, just like BenChao said, Flink as a computation

engine

has

many

different kind of data sources, thus, dynamic

parameters

like

start_offest

can only bind to each table scope, we can not set a

session

config

like

KSQL because they are all about Kafka:

SET ‘auto.offset.reset’=‘earliest’;


Thus the most flexible way to set up these dynamic

params

is

to bind

to

the table scope in the query when we want to

override

something, so

we

have

these solutions above (with pros and cons from my

side):


• 1. Select * from t(offset=123) (from Timo)

Pros:
- Easy to add
- Parameters are part of the main query
Cons:
- Not SQL compliant


• 2. Select * from t /*+ PROPERTIES(offset=123) */

(from

me)


Pros:
- Easy to add
- SQL compliant because it is nested in the

comments


Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users

The biggest problem for hints way may be the “if

hints

must be

optional”,

actually we have though about 1 for a while but

aborted

because it

breaks

the SQL standard too much. And we replace it with

2,

because

the

hints

syntax do not break SQL standard(nested in

comments).


What if we have the special /*+ PROPERTIES */ hint

that

allows

override

some properties of table dynamically, it does not

break

anything, at

lease

for current Flink use cases.

Planner hints are optional just because they are

naturally

enforcers

of

the planner, most of them aim to instruct the

optimizer,

but,

the

table

hints is a little different, table hints can

specify

the

table

meta

like

index column, and it is very convenient to specify

table

properties.


Or shall we not call /*+ PROPERTIES(offset=123) */

table

hint,

we

can

call it table dynamic parameters.

Best,
Danny Chan
在 2020年3月11日 +0800 PM9:20，Aljoscha Krettek <

[email protected]>，写道：

Hi,

I don't understand this discussion. Hints, as I

understand

them,

should

work like this:

- hints are *optional* advice for the optimizer

to

try

and

help it

to

find a good execution strategy
- hints should not change query semantics, i.e.

they

should

not

change

connector properties executing a query with

taking

into

account the

hints *must* produce the same result as executing

the

query

without

taking into account the hints

 From these simple requirements you can derive a

solution

that makes

sense. I don't have a strong preference for the

syntax

but we

should

strive to be in line with prior work.

Best,
Aljoscha

On 11.03.20 11:53, Danny Chan wrote:

Thanks Timo for summarize the 3 options ~

I agree with Kurt that option2 is too

complicated to

use

because:


• As a Kafka topic consumer, the user must

define both

the

virtual

column for start offset and he must apply a special

filter

predicate

after

each query

• And for the internal implementation, the

metadata

column

push

down

is another hard topic, each kind of message queue

may have

its

offset

attribute, we need to consider the expression type

for

different

kind;

the

source also need to recognize the constant column

as

config

option(which

is weird because usually what we pushed down is a

table

column)


For option 1 and option3, I think there is no

difference,

option1

is

also a hint syntax which is introduced in Sybase

and

referenced then

deprecated by MS-SQL in 199X years because of the

ambitiousness.

Personally

I prefer /*+ */ style table hint than WITH keyword

for

these

reasons:


• We do not break the standard SQL, the hints

are

nested

in SQL

comments

• We do not need to introduce additional WITH

keyword

which may

appear

in a query if we use that because a table can be

referenced in

all

kinds

of

SQL contexts: INSERT/DELETE/FROM/JOIN …. That would

make

our

sql

query

break too much of the SQL from standard

• We would have uniform syntax for hints as

query

hint, one

syntax

fits all and more easy to use



And here is the reason why we choose a uniform

Oracle

style query

hint syntax which is addressed by Julian Hyde when

we

design

the

syntax

from the Calcite community:


I don’t much like the MSSQL-style syntax for

table

hints.

It

adds a

new use of the WITH keyword that is unrelated to

the

use of

WITH for

common-table expressions.


A historical note. Microsoft SQL Server

inherited its

hint

syntax

from

Sybase a very long time ago. (See “Transact SQL

Programming”[1], page

632,

“Optimizer hints”. The book was written in 1999,

and

covers

Microsoft

SQL

Server 6.5 / 7.0 and Sybase Adaptive Server 11.5,

but the

syntax very

likely predates Sybase 4.3, from which Microsoft

SQL

Server was

forked

in

1993.)


Microsoft later added the WITH keyword to make

it less

ambiguous,

and

has now deprecated the syntax that does not use

WITH.


They are forced to keep the syntax for

backwards

compatibility

but

that doesn’t mean that we should shoulder their

burden.


I think formatted comments are the right

container for

hints

because

it allows us to change the hint syntax without

changing

the SQL

parser,

and

makes clear that we are at liberty to ignore hints

entirely.


Julian

[1] https://www.amazon.com/s?k=9781565924017 <

https://www.amazon.com/s?k=9781565924017>


Best,
Danny Chan
在 2020年3月11日 +0800 PM4:03，Timo Walther <

[email protected]

，写道：

Hi Danny,

it is true that our DDL is not standard

compliant by

using the

WITH

clause. Nevertheless, we aim for not

diverging

too

much

and the

LIKE

clause is an example of that. It will solve

things

like

overwriting

WATERMARKs, add additional/modifying

properties and

inherit

schema.


Bowen is right that Flink's DDL is mixing 3

types

definition

together.

We are not the first ones that try to solve

this.

There

is also

the

SQL

MED standard [1] that tried to tackle this

problem. I

think it

was

not

considered when designing the current DDL.

Currently, I see 3 options for handling Kafka

offsets. I

will

give

some

examples and look forward to feedback here:

*Option 1* Runtime and semantic parms as part

of the

query


`SELECT * FROM MyTable('offset'=123)`

Pros:
- Easy to add
- Parameters are part of the main query
- No complicated hinting syntax

Cons:
- Not SQL compliant

*Option 2* Use metadata in query

`CREATE TABLE MyTable (id INT, offset AS

SYSTEM_METADATA('offset'))`


`SELECT * FROM MyTable WHERE offset >

TIMESTAMP

'2012-12-12

12:34:22'`


Pros:
- SQL compliant in the query
- Access of metadata in the DDL which is

required

anyway

- Regular pushdown rules apply

Cons:
- Users need to add an additional comlumn in

the DDL


*Option 3*: Use hints for properties

`
SELECT *
FROM MyTable /*+ PROPERTIES('offset'=123) */
`

Pros:
- Easy to add

Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users
- Not standard compliant.

If we go with this option, I would suggest to

make it

available

in

separate map and don't mix it with statically

defined

properties.

Such

that the factory can decide which properties

have the

right to

be

overwritten by the hints:
TableSourceFactory.Context.getQueryHints():

ReadableConfig


Regards,
Timo

[1] https://en.wikipedia.org/wiki/SQL/MED

Currently I see 3 options as a


On 11.03.20 07:21, Danny Chan wrote:

Thanks Bowen ~

I agree we should somehow categorize our

connector

parameters.


For type1, I’m already preparing a solution

like

the

Confluent

schema registry + Avro schema inference thing, so

this may

not

be a

problem

in the near future.


For type3, I have some questions:

"SELECT * FROM mykafka WHERE offset >

12pm

yesterday”


Where does the offset column come from, a

virtual

column from

the

table schema, you said that

They change

almost every time a query starts and have

nothing

to

do with

metadata, thus

should not be part of table definition/DDL

But why you can reference it in the query,

I’m

confused for

that,

can you elaborate a little ?


Best,
Danny Chan
在 2020年3月11日 +0800 PM12:52，Bowen Li <

[email protected]

，写道：

Thanks Danny for kicking off the effort

The root cause of too much manual work is

Flink

DDL

has

mixed 3

types of

params together and doesn't handle each

of

them

very

well.

Below

are how I

categorize them and corresponding

solutions in my

mind:


- type 1: Metadata of external data, like

external

endpoint/url,

username/pwd, schemas, formats.

Such metadata are mostly already

accessible in

external

system

as long as

endpoints and credentials are provided.

Flink can

get it

thru

catalogs, but

we haven't had many catalogs yet and thus

Flink

just

hasn't

been

able to

leverage that. So the solution should be

building

more

catalogs.

Such

params should be part of a Flink table

DDL/definition, and

not

overridable

in any means.


- type 2: Runtime params, like jdbc

connector's

fetch size,

elasticsearch

connector's bulk flush size.

Such params don't affect query results,

but

affect

how

results

are produced

(eg. fast or slow, aka performance) -

they

are

essentially

execution and

implementation details. They change often

in

exploration or

development

stages, but not quite frequently in

well-defined

long-running

pipelines.

They should always have default values

and

can be

missing

in

query. They

can be part of a table DDL/definition,

but

should

also be

replaceable in a

query - *this is what table "hints" in

FLIP-113

should

cover*.



- type 3: Semantic params, like kafka

connector's

start

offset.


Such params affect query results - the

semantics.

They'd

better

be as

filter conditions in WHERE clause that

can

be

pushed

down.

They

change

almost every time a query starts and have

nothing to

do

with

metadata, thus

should not be part of table

definition/DDL, nor

be

persisted

in

catalogs.

If they will, users should create views

to

keep

such

params

around (note

this is different from variable

substitution).



Take Flink-Kafka as an example. Once we

get these

params

right,

here're the

steps users need to do to develop and run

a Flink

job:

- configure a Flink

ConfluentSchemaRegistry with

url,

username,

and password

- run "SELECT * FROM mykafka WHERE offset

12pm

yesterday"

(simplified

timestamp) in SQL CLI, Flink

automatically

retrieves

all

metadata of

schema, file format, etc and start the

job

- users want to make the job read Kafka

topic

faster, so it

goes

as "SELECT

* FROM mykafka /* faster_read_key=value*/

WHERE

offset >

12pm

yesterday"

- done and satisfied, users submit it to

production



Regarding "CREATE TABLE t LIKE with

(k1=v1,

k2=v2),

I think

it's

nice-to-have feature, but not a

strategically

critical,

long-term solution,

because
1) It may seem promising at the current

stage to

solve the

too-much-manual-work problem, but that's

only

because Flink

hasn't

leveraged catalogs well and handled the 3

types

of

params

above

properly.

Once we get the params types right, the

LIKE

syntax

won't

be

that

important, and will be just an easier way

to

create

tables

without retyping

long fields like username and pwd.
2) Note that only some rare type of

catalog can

store k-v

property pair, so

table created this way often cannot be

persisted. In

the

foreseeable

future, such catalog will only be

HiveCatalog,

and

not

everyone

has a Hive

metastore. To be honest, without

persistence,

recreating

tables

every time

this way is still a lot of keyboard

typing.


Cheers,
Bowen

On Tue, Mar 10, 2020 at 8:07 PM Kurt

Young

[email protected]

wrote:

If a specific connector want to have

such

parameter and

read

if out of

configuration, then that's fine.
If we are talking about a configuration

for all

kinds of

sources, I would

be super careful about that.
It's true it can solve maybe 80% cases,

but it

will also

make

the left 20%

feels weird.

Best,
Kurt


On Wed, Mar 11, 2020 at 11:00 AM Jark

Wu

[email protected]

wrote:

Hi Kurt,

#3 Regarding to global offset:
I'm not saying to use the global

configuration to

override

connector

properties by the planner.
But the connector should take this

configuration

and

translate into their

client API.
AFAIK, almost all the message queues

support

eariliest

and

latest and a

timestamp value as start point.
So we can support 3 options for this

configuration:

"eariliest", "latest"

and a timestamp string value.
Of course, this can't solve 100%

cases, but I

guess can

sovle 80% or 90%

cases.
And the remaining cases can be

resolved by

LIKE

syntax

which

I guess is

not

very common cases.

Best,
Jark


On Wed, 11 Mar 2020 at 10:33, Kurt

Young <

[email protected]


wrote:

Good to have such lovely

discussions. I

also

want to

share

some of my

opinions.

#1 Regarding to error handling: I

also

think

ignore

invalid hints would

be

dangerous, maybe
the simplest solution is just throw

an

exception.


#2 Regarding to property

replacement: I

don't

think

we

should

constraint

ourself to
the meaning of the word "hint", and

forbidden

it

modifying

any

properties

which can effect
query results. IMO `PROPERTIES` is

one of

the

table

hints,

and a

powerful

one. It can
modify properties located in DDL's

WITH

block.

But I

also

see the harm

that

if we make it
too flexible like change the kafka

topic

name

with a

hint.

Such use

case

is

not common and
sounds very dangerous to me. I

would

propose

we have

map

of hintable

properties for each
connector, and should validate all

passed

in

properties

are actually

hintable. And combining with
#1 error handling, we can throw an

exception

once

received

invalid

property.

#3 Regarding to global offset: I'm

not sure

it's

feasible.

Different

connectors will have totally
different properties to represent

offset,

some

might

be

timestamps,

some

might be string literals
like "earliest", and others might

be

just

integers.


Best,
Kurt


On Tue, Mar 10, 2020 at 11:46 PM

Jark Wu <

[email protected]>

wrote:

Hi everyone,

I want to jump in the discussion

about

the

"dynamic

start offset"

problem.

First of all, I share the same

concern

with

Timo

and

Fabian, that the

"start offset" affects the query

semantics,

i.e.

the

query result.

But "hints" is just used for

optimization

which

should

affect the

result?


I think the "dynamic start

offset"

is an

very

important

usability

problem

which will be faced by many

streaming

platforms.

I also agree "CREATE TEMPORARY

TABLE Temp

(LIKE t)

WITH

('connector.startup-timestamp-millis' =

'1578538374471')" is verbose,

what

if we have 10 tables to join?

However, what I want to propose

(should

be

another

thread) is a

global

configuration to reset start

offsets of

all

the

source

connectors

in the query session, e.g.

"table.sources.start-offset".

This is

possible

now because

`TableSourceFactory.Context`

has

`getConfiguration`

method to get the session

configuration,

and

use it

to

create an

adapted

TableSource.
Then we can also expose to SQL

CLI

via

SET

command,

e.g.

`SET

'table.sources.start-offset'='earliest';`,

which is

pretty simple and

straightforward.

This is very similar to KSQL's

`SET

'auto.offset.reset'='earliest'`

which

is very helpful IMO.

Best,
Jark


On Tue, 10 Mar 2020 at 22:29,

Timo

Walther <

[email protected]>

wrote:

Hi Danny,

compared to the hints, FLIP-110

is

fully

compliant

to

the SQL

standard.


I don't think that `CREATE

TEMPORARY

TABLE

Temp

(LIKE

t) WITH

(k=v)`

is

too verbose or awkward for the

power of

basically

changing the

entire

connector. Usually, this

statement

would

just

precede

the query in

multiline file. So it can be

change

"in-place"

like

the hints you

proposed.


Many companies have a

well-defined set

of

tables

that

should be

used.

It

would be dangerous if users can

change

the

path

or

topic in a hint.

The

catalog/catalog manager should

be the

entity that

controls which

tables

exist and how they can be

accessed.

what’s the problem there if

we

user

the

table

hints

to support

“start

offset”?

IMHO it violates the meaning of

a hint.

According

to

the

dictionary,

hint is "a statement that

expresses

indirectly

what

one prefers not

to

say explicitly". But offsets

are

property that

are

very explicit.


If we go with the hint

approach,

it

should

be

expressible in the

TableSourceFactory which

properties are

supported

for

hinting. Or

do

you

plan to offer those hints in a

separate

Map<String,

String> that

cannot

overwrite existing properties?

think

this would

be

different

story...


Regards,
Timo


On 10.03.20 10:34, Danny Chan

wrote:

Thanks Timo ~

Personally I would say that

offset >

and

start

offset = 10 does

not

have the same semantic, so from

the SQL

aspect,

we

can

not

implement

“starting offset” hint for

query

with

such

syntax.


And the CREATE TABLE LIKE

syntax is a

DDL which

is

just verbose

for

defining such dynamic

parameters

even

if

it could

do

that, shall we

force

users to define a temporal

table

for

each

query

with

dynamic

params,

would say it’s an awkward

solution.


"Hints should give "hints"

but

not

affect the

actual

produced

result.”

You mentioned that multiple

times and

could we

give a

reason,

what’s

the

problem there if we user the

table

hints to

support

“start offset”

From

my side I saw some benefits for

that:



• It’s very convent to set up

these

parameters,

the

syntax is

very

much

like the DDL definition

• It’s scope is very clear,

right on

the

table

it

attathed

• It does not affect the

table

schema,

which

means

in order to

specify

the offset, there is no need to

define

an

offset

column which is

weird

actually, offset should never

be

column,

it’s

more

like a

metadata

or a

start option.


So in total, FLIP-110 uses

the

offset

more

like a

Hive partition

prune,

we can do that if we have an

offset

column, but

most

of the case we

do

not

define that, so there is

actually no

conflict or

overlap.


Best,
Danny Chan
在 2020年3月10日 +0800

PM4:28，Timo

Walther <

[email protected]>，写道：

Hi Danny,

shouldn't FLIP-110[1] solve

most

of the

problems

we have around

defining

table properties more

dynamically

without

manual

schema work?

Also

offset definition is easier

with

such a

syntax.

They must not be

defined

in catalog but could be

temporary

tables that

extend from the

original

table.

In general, we should aim

to

keep

the

syntax

concise and don't

provide

too many ways of doing the

same

thing.

Hints

should give "hints"

but

not

affect the actual produced

result.


Some connector properties

might

also

change

the

plan or schema

in

the

future. E.g. they might

also

define

whether a

table source

supports

certain push-downs (e.g.

predicate

push-down).


Dawid is currently working

draft

that might

makes it possible

to

expose a Kafka offset via

the

schema

such

that

`SELECT * FROM

Topic

WHERE offset > 10` would

become

possible and

could

be pushed

down.

But

this is of course, not

planned

initially.


Regards,
Timo


[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-110%3A+Support+LIKE+clause+in+CREATE+TABLE




On 10.03.20 08:34, Danny

Chan

wrote:

Thanks Wenlong ~

For PROPERTIES Hint Error

handling


Actually we have no way

to

figure out

whether a

error prone

hint

is a

PROPERTIES hint, for example,

if

use

writes a

hint

like

‘PROPERTIAS’,

we

do

not know if this hint is a

PROPERTIES

hint, what

we

know is that

the

hint

name was not registered in our

Flink.


If the user writes the

hint name

correctly

(i.e.

PROPERTIES),

we

did

can enforce the validation of

the hint

options

though

the pluggable

HintOptionChecker.


For PROPERTIES Hint

Option

Format


For a key value style

hint

option,

the key

can

be either a

simple

identifier or a string literal,

which

means that

it’s

compatible

with

our

DDL syntax. We support simple

identifier

because

many

other hints

do

not

have the component complex keys

like

the

table

properties, and we

want

to

unify the parse block.


Best,
Danny Chan
在 2020年3月10日 +0800

PM3:19，wenlong.lwl <

[email protected]

，写道：

Hi Danny, thanks for

the

proposal.

+1 for

adding table hints,

it

is

really

a necessary feature for

flink

sql

to

integrate

with a catalog.


For error handling, I

think it

would be

more

natural to throw

an

exception when error

table hint

provided,

because the

properties

in

hint

will be merged and used

to find

the table

factory which would

cause

an

exception when error

properties

provided,

right? On the other

hand,

unlike

other hints which just

affect

the

way to

execute the query,

the

property

table hint actually

affects the

result of

the

query, we should

never

ignore

the given property

hints.


For the format of

property

hints,

currently,

in sql client, we

accept

properties in format of

string

only in

DDL:

'connector.type'='kafka',

think the format of

properties

in

hint

should

be the same as

the

format we

defined in ddl. What do

you

think?


Bests,
Wenlong Lyu

On Tue, 10 Mar 2020 at

14:22,

Danny Chan

[email protected]>

wrote:

To Weike: About the

Error

Handing


To be consistent with

other

SQL

vendors,

the

default is to

log

warnings

and if there is any

error

(invalid hint

name

or options), the

hint

is just

ignored. I have

already

addressed in

the

wiki.


To Timo: About the

PROPERTIES

Table

Hint


• The properties

hints

is

also

optional,

user can pass in an

option

to

override the table

properties

but this

does

not mean it is

required.

• They should not

include

semantics:

does

the properties

belong

to

semantic ? I don't

think so,

the

plan

does

not change right ?

The

result

set may be affected,

but

there

are

already

some hints do so,

for

example,

MS-SQL MAXRECURSION

and

SNAPSHOT

hint

[1]

• `SELECT * FROM

t(k=v,

k=v)`:

this

grammar

breaks the SQL

standard

compared to the hints

way(which

is

included

in comments)

• I actually didn't

found any

vendors

to

support such

grammar,

and

there

is no way to override

table

level

properties

dynamically. For

normal

RDBMS,

I think there are no

requests

for such

dynamic parameters

because

all the

table have the same

storage

and

computation

and they are

almost

all

batch

tables.
• While Flink as a

computation

engine

has

many connectors,

especially for

some message queue

like

Kafka,

we would

have

a start_offset

which

is

different each time

we

start

the

query,

such

parameters can

not

be

persisted to catalog,

because

it’s not

static, this is

actually

the

background we propose

the

table

hints

to

indicate such

properties

dynamically.


To Jark and Jinsong:

have

removed the

query hints part and

change

the

title.

[1]

https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15


Best,
Danny Chan
在 2020年3月9日 +0800

PM5:46，Timo

Walther <

[email protected]

，写道：

Hi Danny,

thanks for the

proposal. I

agree with

Jark

and Jingsong.

Planner

hints

and table hints are

orthogonal

topics

that

should be

discussed

separately.


I share Jingsong's

opinion

that we

should

not use planner

hints

for

passing connector

properties.

Planner

hints should be

optional

at

any

time. They should

not

include

semantics

but only affect

execution

time.

Connector

properties

are an

important

part

of the query

itself.


Have you thought

about

options

such

as

`SELECT * FROM t(k=v,

k=v)`?

How

are other vendors

deal with

this

problem?


Regards,
Timo


On 09.03.20 10:37,

Jingsong Li

wrote:

Hi Danny, +1 for

table

hints,

thanks

for

driving.


I took a look to

FLIP,

most

of

content

are talking about

query

hints.

It is

hard to

discussion

and

voting. So

+1

to

split it as Jark

said.


Another thing is

configuration that

suitable to config with

table

hints:

"connector.path"

and

"connector.topic",

Are they really

suitable

for

table

hints? Looks

weird

to me.

Because I

think these properties

are

the

core of

table.

Best,
Jingsong Lee

On Mon, Mar 9,

2020 at

5:30

PM Jark

Wu

[email protected]>

wrote:

Thanks Danny

for

starting

the

discussion.

+1 for this

feature.


If we just

focus

on the

table

hints

not the query hints in

this

release,

could you split

the

FLIP

into two

FLIPs?

Because it's

hard to

vote

on

partial

part of a FLIP. You

can

keep

the table

hints proposal

in

FLIP-113

and

move

query hints into

another

FLIP.

So that we can

focuse

on

the

table

hints in the FLIP.


Thanks,
Jark



On Mon, 9 Mar

2020 at

17:14,

DONG,

Weike <

[email protected]

wrote:

Hi Danny,

This is a

nice

feature,

+1.


One thing I

am

interested in

but

not

mentioned in the

proposal

is

the

error

handling, as

it is

quite

common

for

users to write

inappropriate

hints in

SQL code, if

illegal

or

"bad"

hints

are given, would the

system

simply

ignore them

or

throw

exceptions?


Thanks : )

Best,
Weike

On Mon, Mar

9,

at

5:02 PM

Danny

Chan <

[email protected]>

wrote:

Note:
we only

plan

to

support table

hints in Flink release

1.11,

so

please

focus

mainly on

the table

hints

part

and

just ignore the

planner

hints, sorry

for

that

mistake


Best,
Danny Chan
在 2020年3月9日

+0800

PM4:36，Danny

Chan <

[email protected]

，写道：

Hi,

fellows ~


I would

like to

propose the

supports for SQL hints for

our

Flink SQL.


We would

support

hints

syntax

as

following:


select

/*+

NO_HASH_JOIN,

RESOURCE(mem='128mb',

parallelism='24') */

from
emp /*+

INDEX(idx1,

idx2)

*/

join
dept /*+

PROPERTIES(k1='v1',

k2='v2') */

on

emp.deptno

dept.deptno


Basically

we

would

support

both

query hints(after the

SELECT

keyword)

and table

hints(after

the

referenced table name), for

1.11,

we

plan to

only

support

table hints

with a

hint

probably named

PROPERTIES:

table_name

/*+

PROPERTIES(k1='v1', k2='v2') *+/


I am

looking

forward

to

your

comments.


You can

access

the

FLIP

here:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+SQL+and+Planner+Hints


Best,
Danny

Chan

Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Reply via email to