Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Aljoscha Krettek Wed, 11 Mar 2020 06:21:12 -0700

Hi,

I don't understand this discussion. Hints, as I understand them, shouldwork like this:

- hints are *optional* advice for the optimizer to try and help it tofind a good execution strategy- hints should not change query semantics, i.e. they should not changeconnector properties executing a query with taking into account thehints *must* produce the same result as executing the query withouttaking into account the hints

From these simple requirements you can derive a solution that makessense. I don't have a strong preference for the syntax but we shouldstrive to be in line with prior work.


Best,
Aljoscha

On 11.03.20 11:53, Danny Chan wrote:

Thanks Timo for summarize the 3 options ~

I agree with Kurt that option2 is too complicated to use because:

• As a Kafka topic consumer, the user must define both the virtual column for 
start offset and he must apply a special filter predicate after each query
• And for the internal implementation, the metadata column push down is another 
hard topic, each kind of message queue may have its offset attribute, we need 
to consider the expression type for different kind; the source also need to 
recognize the constant column as a config option(which is weird because usually 
what we pushed down is a table column)

For option 1 and option3, I think there is no difference, option1 is also a 
hint syntax which is introduced in Sybase and referenced then deprecated by 
MS-SQL in 199X years because of the ambitiousness. Personally I prefer /*+ */ 
style table hint than WITH keyword for these reasons:

• We do not break the standard SQL, the hints are nested in SQL comments
• We do not need to introduce additional WITH keyword which may appear in a 
query if we use that because a table can be referenced in all kinds of SQL 
contexts: INSERT/DELETE/FROM/JOIN …. That would make our sql query break too 
much of the SQL from standard
• We would have uniform syntax for hints as query hint, one syntax fits all and 
more easy to use


And here is the reason why we choose a uniform Oracle style query hint syntax 
which is addressed by Julian Hyde when we design the syntax from the Calcite 
community:

I don’t much like the MSSQL-style syntax for table hints. It adds a new use of 
the WITH keyword that is unrelated to the use of WITH for common-table 
expressions.

A historical note. Microsoft SQL Server inherited its hint syntax from Sybase a 
very long time ago. (See “Transact SQL Programming”[1], page 632, “Optimizer 
hints”. The book was written in 1999, and covers Microsoft SQL Server 6.5 / 7.0 
and Sybase Adaptive Server 11.5, but the syntax very likely predates Sybase 
4.3, from which Microsoft SQL Server was forked in 1993.)

Microsoft later added the WITH keyword to make it less ambiguous, and has now 
deprecated the syntax that does not use WITH.

They are forced to keep the syntax for backwards compatibility but that doesn’t 
mean that we should shoulder their burden.

I think formatted comments are the right container for hints because it allows 
us to change the hint syntax without changing the SQL parser, and makes clear 
that we are at liberty to ignore hints entirely.

Julian

[1] https://www.amazon.com/s?k=9781565924017 
<https://www.amazon.com/s?k=9781565924017>

Best,
Danny Chan
在 2020年3月11日 +0800 PM4:03，Timo Walther <twal...@apache.org>，写道：

Hi Danny,

it is true that our DDL is not standard compliant by using the WITH
clause. Nevertheless, we aim for not diverging too much and the LIKE
clause is an example of that. It will solve things like overwriting
WATERMARKs, add additional/modifying properties and inherit schema.

Bowen is right that Flink's DDL is mixing 3 types definition together.
We are not the first ones that try to solve this. There is also the SQL
MED standard [1] that tried to tackle this problem. I think it was not
considered when designing the current DDL.

Currently, I see 3 options for handling Kafka offsets. I will give some
examples and look forward to feedback here:

*Option 1* Runtime and semantic parms as part of the query

`SELECT * FROM MyTable('offset'=123)`

Pros:
- Easy to add
- Parameters are part of the main query
- No complicated hinting syntax

Cons:
- Not SQL compliant

*Option 2* Use metadata in query

`CREATE TABLE MyTable (id INT, offset AS SYSTEM_METADATA('offset'))`

`SELECT * FROM MyTable WHERE offset > TIMESTAMP '2012-12-12 12:34:22'`

Pros:
- SQL compliant in the query
- Access of metadata in the DDL which is required anyway
- Regular pushdown rules apply

Cons:
- Users need to add an additional comlumn in the DDL

*Option 3*: Use hints for properties

`
SELECT *
FROM MyTable /*+ PROPERTIES('offset'=123) */
`

Pros:
- Easy to add

Cons:
- Parameters are not part of the main query
- Cryptic syntax for new users
- Not standard compliant.

If we go with this option, I would suggest to make it available in a
separate map and don't mix it with statically defined properties. Such
that the factory can decide which properties have the right to be
overwritten by the hints:
TableSourceFactory.Context.getQueryHints(): ReadableConfig

Regards,
Timo

[1] https://en.wikipedia.org/wiki/SQL/MED

Currently I see 3 options as a


On 11.03.20 07:21, Danny Chan wrote:

Thanks Bowen ~

I agree we should somehow categorize our connector parameters.

For type1, I’m already preparing a solution like the Confluent schema registry 
+ Avro schema inference thing, so this may not be a problem in the near future.

For type3, I have some questions:

"SELECT * FROM mykafka WHERE offset > 12pm yesterday”


Where does the offset column come from, a virtual column from the table schema, 
you said that

They change

almost every time a query starts and have nothing to do with metadata, thus
should not be part of table definition/DDL

But why you can reference it in the query, I’m confused for that, can you 
elaborate a little ?

Best,
Danny Chan
在 2020年3月11日 +0800 PM12:52，Bowen Li <bowenl...@gmail.com>，写道：

Thanks Danny for kicking off the effort

The root cause of too much manual work is Flink DDL has mixed 3 types of
params together and doesn't handle each of them very well. Below are how I
categorize them and corresponding solutions in my mind:

- type 1: Metadata of external data, like external endpoint/url,
username/pwd, schemas, formats.

Such metadata are mostly already accessible in external system as long as
endpoints and credentials are provided. Flink can get it thru catalogs, but
we haven't had many catalogs yet and thus Flink just hasn't been able to
leverage that. So the solution should be building more catalogs. Such
params should be part of a Flink table DDL/definition, and not overridable
in any means.


- type 2: Runtime params, like jdbc connector's fetch size, elasticsearch
connector's bulk flush size.

Such params don't affect query results, but affect how results are produced
(eg. fast or slow, aka performance) - they are essentially execution and
implementation details. They change often in exploration or development
stages, but not quite frequently in well-defined long-running pipelines.
They should always have default values and can be missing in query. They
can be part of a table DDL/definition, but should also be replaceable in a
query - *this is what table "hints" in FLIP-113 should cover*.


- type 3: Semantic params, like kafka connector's start offset.

Such params affect query results - the semantics. They'd better be as
filter conditions in WHERE clause that can be pushed down. They change
almost every time a query starts and have nothing to do with metadata, thus
should not be part of table definition/DDL, nor be persisted in catalogs.
If they will, users should create views to keep such params around (note
this is different from variable substitution).


Take Flink-Kafka as an example. Once we get these params right, here're the
steps users need to do to develop and run a Flink job:
- configure a Flink ConfluentSchemaRegistry with url, username, and password
- run "SELECT * FROM mykafka WHERE offset > 12pm yesterday" (simplified
timestamp) in SQL CLI, Flink automatically retrieves all metadata of
schema, file format, etc and start the job
- users want to make the job read Kafka topic faster, so it goes as "SELECT
* FROM mykafka /* faster_read_key=value*/ WHERE offset > 12pm yesterday"
- done and satisfied, users submit it to production


Regarding "CREATE TABLE t LIKE with (k1=v1, k2=v2), I think it's a
nice-to-have feature, but not a strategically critical, long-term solution,
because
1) It may seem promising at the current stage to solve the
too-much-manual-work problem, but that's only because Flink hasn't
leveraged catalogs well and handled the 3 types of params above properly.
Once we get the params types right, the LIKE syntax won't be that
important, and will be just an easier way to create tables without retyping
long fields like username and pwd.
2) Note that only some rare type of catalog can store k-v property pair, so
table created this way often cannot be persisted. In the foreseeable
future, such catalog will only be HiveCatalog, and not everyone has a Hive
metastore. To be honest, without persistence, recreating tables every time
this way is still a lot of keyboard typing.

Cheers,
Bowen

On Tue, Mar 10, 2020 at 8:07 PM Kurt Young <ykt...@gmail.com> wrote:

If a specific connector want to have such parameter and read if out of
configuration, then that's fine.
If we are talking about a configuration for all kinds of sources, I would
be super careful about that.
It's true it can solve maybe 80% cases, but it will also make the left 20%
feels weird.

Best,
Kurt


On Wed, Mar 11, 2020 at 11:00 AM Jark Wu <imj...@gmail.com> wrote:

Hi Kurt,

#3 Regarding to global offset:
I'm not saying to use the global configuration to override connector
properties by the planner.
But the connector should take this configuration and translate into their
client API.
AFAIK, almost all the message queues support eariliest and latest and a
timestamp value as start point.
So we can support 3 options for this configuration: "eariliest", "latest"
and a timestamp string value.
Of course, this can't solve 100% cases, but I guess can sovle 80% or 90%
cases.
And the remaining cases can be resolved by LIKE syntax which I guess is

not

very common cases.

Best,
Jark


On Wed, 11 Mar 2020 at 10:33, Kurt Young <ykt...@gmail.com> wrote:

Good to have such lovely discussions. I also want to share some of my
opinions.

#1 Regarding to error handling: I also think ignore invalid hints would

be

dangerous, maybe
the simplest solution is just throw an exception.

#2 Regarding to property replacement: I don't think we should

constraint

ourself to
the meaning of the word "hint", and forbidden it modifying any

properties

which can effect
query results. IMO `PROPERTIES` is one of the table hints, and a

powerful

one. It can
modify properties located in DDL's WITH block. But I also see the harm

that

if we make it
too flexible like change the kafka topic name with a hint. Such use

case

is

not common and
sounds very dangerous to me. I would propose we have a map of hintable
properties for each
connector, and should validate all passed in properties are actually
hintable. And combining with
#1 error handling, we can throw an exception once received invalid
property.

#3 Regarding to global offset: I'm not sure it's feasible. Different
connectors will have totally
different properties to represent offset, some might be timestamps,

some

might be string literals
like "earliest", and others might be just integers.

Best,
Kurt


On Tue, Mar 10, 2020 at 11:46 PM Jark Wu <imj...@gmail.com> wrote:

Hi everyone,

I want to jump in the discussion about the "dynamic start offset"

problem.

First of all, I share the same concern with Timo and Fabian, that the
"start offset" affects the query semantics, i.e. the query result.
But "hints" is just used for optimization which should affect the

result?


I think the "dynamic start offset" is an very important usability

problem

which will be faced by many streaming platforms.
I also agree "CREATE TEMPORARY TABLE Temp (LIKE t) WITH
('connector.startup-timestamp-millis' = '1578538374471')" is verbose,

what

if we have 10 tables to join?

However, what I want to propose (should be another thread) is a

global

configuration to reset start offsets of all the source connectors
in the query session, e.g. "table.sources.start-offset". This is

possible

now because `TableSourceFactory.Context` has `getConfiguration`
method to get the session configuration, and use it to create an

adapted

TableSource.
Then we can also expose to SQL CLI via SET command, e.g. `SET
'table.sources.start-offset'='earliest';`, which is pretty simple and
straightforward.

This is very similar to KSQL's `SET 'auto.offset.reset'='earliest'`

which

is very helpful IMO.

Best,
Jark


On Tue, 10 Mar 2020 at 22:29, Timo Walther <twal...@apache.org>

wrote:

Hi Danny,

compared to the hints, FLIP-110 is fully compliant to the SQL

standard.


I don't think that `CREATE TEMPORARY TABLE Temp (LIKE t) WITH

(k=v)`

is

too verbose or awkward for the power of basically changing the

entire

connector. Usually, this statement would just precede the query in

multiline file. So it can be change "in-place" like the hints you

proposed.


Many companies have a well-defined set of tables that should be

used.

It

would be dangerous if users can change the path or topic in a hint.

The

catalog/catalog manager should be the entity that controls which

tables

exist and how they can be accessed.

what’s the problem there if we user the table hints to support

“start

offset”?

IMHO it violates the meaning of a hint. According to the

dictionary,

hint is "a statement that expresses indirectly what one prefers not

to

say explicitly". But offsets are a property that are very explicit.

If we go with the hint approach, it should be expressible in the
TableSourceFactory which properties are supported for hinting. Or

do

you

plan to offer those hints in a separate Map<String, String> that

cannot

overwrite existing properties? I think this would be a different

story...


Regards,
Timo


On 10.03.20 10:34, Danny Chan wrote:

Thanks Timo ~

Personally I would say that offset > 0 and start offset = 10 does

not

have the same semantic, so from the SQL aspect, we can not

implement

“starting offset” hint for query with such a syntax.


And the CREATE TABLE LIKE syntax is a DDL which is just verbose

for

defining such dynamic parameters even if it could do that, shall we

force

users to define a temporal table for each query with dynamic

params,

would say it’s an awkward solution.


"Hints should give "hints" but not affect the actual produced

result.”

You mentioned that multiple times and could we give a reason,

what’s

the

problem there if we user the table hints to support “start offset”

From

my side I saw some benefits for that:



• It’s very convent to set up these parameters, the syntax is

very

much

like the DDL definition

• It’s scope is very clear, right on the table it attathed
• It does not affect the table schema, which means in order to

specify

the offset, there is no need to define an offset column which is

weird

actually, offset should never be a column, it’s more like a

metadata

or a

start option.


So in total, FLIP-110 uses the offset more like a Hive partition

prune,

we can do that if we have an offset column, but most of the case we

do

not

define that, so there is actually no conflict or overlap.


Best,
Danny Chan
在 2020年3月10日 +0800 PM4:28，Timo Walther <twal...@apache.org>，写道：

Hi Danny,

shouldn't FLIP-110[1] solve most of the problems we have around

defining

table properties more dynamically without manual schema work?

Also

offset definition is easier with such a syntax. They must not be

defined

in catalog but could be temporary tables that extend from the

original

table.

In general, we should aim to keep the syntax concise and don't

provide

too many ways of doing the same thing. Hints should give "hints"

but

not

affect the actual produced result.

Some connector properties might also change the plan or schema

in

the

future. E.g. they might also define whether a table source

supports

certain push-downs (e.g. predicate push-down).

Dawid is currently working a draft that might makes it possible

to

expose a Kafka offset via the schema such that `SELECT * FROM

Topic

WHERE offset > 10` would become possible and could be pushed

down.

But

this is of course, not planned initially.

Regards,
Timo


[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-110%3A+Support+LIKE+clause+in+CREATE+TABLE




On 10.03.20 08:34, Danny Chan wrote:

Thanks Wenlong ~

For PROPERTIES Hint Error handling

Actually we have no way to figure out whether a error prone

hint

is a

PROPERTIES hint, for example, if use writes a hint like

‘PROPERTIAS’,

we

do

not know if this hint is a PROPERTIES hint, what we know is that

the

hint

name was not registered in our Flink.


If the user writes the hint name correctly (i.e. PROPERTIES),

we

did

can enforce the validation of the hint options though the pluggable
HintOptionChecker.


For PROPERTIES Hint Option Format

For a key value style hint option, the key can be either a

simple

identifier or a string literal, which means that it’s compatible

with

our

DDL syntax. We support simple identifier because many other hints

do

not

have the component complex keys like the table properties, and we

want

to

unify the parse block.


Best,
Danny Chan
在 2020年3月10日 +0800 PM3:19，wenlong.lwl <wenlong88....@gmail.com

，写道：

Hi Danny, thanks for the proposal. +1 for adding table hints,

it

is

really

a necessary feature for flink sql to integrate with a catalog.

For error handling, I think it would be more natural to throw

an

exception when error table hint provided, because the

properties

in

hint

will be merged and used to find the table factory which would

cause

an

exception when error properties provided, right? On the other

hand,

unlike

other hints which just affect the way to execute the query,

the

property

table hint actually affects the result of the query, we should

never

ignore

the given property hints.

For the format of property hints, currently, in sql client, we

accept

properties in format of string only in DDL:

'connector.type'='kafka',

think the format of properties in hint should be the same as

the

format we

defined in ddl. What do you think?

Bests,
Wenlong Lyu

On Tue, 10 Mar 2020 at 14:22, Danny Chan <

yuzhao....@gmail.com>

wrote:

To Weike: About the Error Handing

To be consistent with other SQL vendors, the default is to

log

warnings

and if there is any error (invalid hint name or options), the

hint

is just

ignored. I have already addressed in the wiki.

To Timo: About the PROPERTIES Table Hint

• The properties hints is also optional, user can pass in an

option

to

override the table properties but this does not mean it is

required.

• They should not include semantics: does the properties

belong

to

semantic ? I don't think so, the plan does not change right ?

The

result

set may be affected, but there are already some hints do so,

for

example,

MS-SQL MAXRECURSION and SNAPSHOT hint [1]
• `SELECT * FROM t(k=v, k=v)`: this grammar breaks the SQL

standard

compared to the hints way(which is included in comments)
• I actually didn't found any vendors to support such

grammar,

and

there

is no way to override table level properties dynamically. For

normal

RDBMS,

I think there are no requests for such dynamic parameters

because

all the

table have the same storage and computation and they are

almost

all

batch

tables.
• While Flink as a computation engine has many connectors,

especially for

some message queue like Kafka, we would have a start_offset

which

is

different each time we start the query, such parameters can

not

be

persisted to catalog, because it’s not static, this is

actually

the

background we propose the table hints to indicate such

properties

dynamically.


To Jark and Jinsong: I have removed the query hints part and

change

the

title.

[1]

https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15


Best,
Danny Chan
在 2020年3月9日 +0800 PM5:46，Timo Walther <twal...@apache.org

，写道：

Hi Danny,

thanks for the proposal. I agree with Jark and Jingsong.

Planner

hints

and table hints are orthogonal topics that should be

discussed

separately.


I share Jingsong's opinion that we should not use planner

hints

for

passing connector properties. Planner hints should be

optional

at

any

time. They should not include semantics but only affect

execution

time.

Connector properties are an important part of the query

itself.


Have you thought about options such as `SELECT * FROM t(k=v,

k=v)`?

How

are other vendors deal with this problem?

Regards,
Timo


On 09.03.20 10:37, Jingsong Li wrote:

Hi Danny, +1 for table hints, thanks for driving.

I took a look to FLIP, most of content are talking about

query

hints.

It is

hard to discussion and voting. So +1 to split it as Jark

said.


Another thing is configuration that suitable to config with

table

hints:

"connector.path" and "connector.topic", Are they really

suitable

for

table

hints? Looks weird to me. Because I think these properties

are

the

core of

table.

Best,
Jingsong Lee

On Mon, Mar 9, 2020 at 5:30 PM Jark Wu <imj...@gmail.com>

wrote:

Thanks Danny for starting the discussion.
+1 for this feature.

If we just focus on the table hints not the query hints in

this

release,

could you split the FLIP into two FLIPs?
Because it's hard to vote on partial part of a FLIP. You

can

keep

the table

hints proposal in FLIP-113 and move query hints into

another

FLIP.

So that we can focuse on the table hints in the FLIP.

Thanks,
Jark



On Mon, 9 Mar 2020 at 17:14, DONG, Weike <

kyled...@connect.hku.hk

wrote:

Hi Danny,

This is a nice feature, +1.

One thing I am interested in but not mentioned in the

proposal

is

the

error

handling, as it is quite common for users to write

inappropriate

hints in

SQL code, if illegal or "bad" hints are given, would the

system

simply

ignore them or throw exceptions?

Thanks : )

Best,
Weike

On Mon, Mar 9, 2020 at 5:02 PM Danny Chan <

yuzhao....@gmail.com>

wrote:

Note:
we only plan to support table hints in Flink release

1.11,

so

please

focus

mainly on the table hints part and just ignore the

planner

hints, sorry

for

that mistake ~

Best,
Danny Chan
在 2020年3月9日 +0800 PM4:36，Danny Chan <

yuzhao....@gmail.com

，写道：

Hi, fellows ~

I would like to propose the supports for SQL hints for

our

Flink SQL.


We would support hints syntax as following:

select /*+ NO_HASH_JOIN, RESOURCE(mem='128mb',

parallelism='24') */

from
emp /*+ INDEX(idx1, idx2) */
join
dept /*+ PROPERTIES(k1='v1', k2='v2') */
on
emp.deptno = dept.deptno

Basically we would support both query hints(after the

SELECT

keyword)

and table hints(after the referenced table name), for

1.11,

we

plan to

only

support table hints with a hint probably named

PROPERTIES:


table_name /*+ PROPERTIES(k1='v1', k2='v2') *+/

I am looking forward to your comments.

You can access the FLIP here:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-113%3A+SQL+and+Planner+Hints


Best,
Danny Chan

Re: [DISCUSS]FLIP-113: Support SQL and planner hints

Reply via email to