Re: [DISCUSS] FLIP-107: Reading table columns from different parts of source records

Leonard Xu Thu, 10 Sep 2020 06:02:09 -0700

Hi, Timo

Thanks for the update
I have a minor suggestion about the debezium metadata key,
Could we use the original  debezium key rather than import new key?


debezium-json.schema                       => debezium-json.schema 
debezium-json.ingestion-timestamp  =>  debezium-json.ts_ms
debezium-json.source.database       =>  debezium-json.source.db
debezium-json.source.schema         =>  debezium-json.source.schema
debezium-json.source.table              =>  debezium-json.source.table
debezium-json.source.timestamp     =>  debezium-json.source.ts_ms
debezium-json.source.properties      =>  debezium-json.source MAP<STRING, 
STRING>
 
User who familiar with debezium will understand the key easier,  and the key 
syntax is more json-path like. HDYT?


The other part looks really good to me.


Regards,
Leonard


> 在 2020年9月10日，18:26，Aljoscha Krettek <[email protected]> 写道：
> 
> I've only been watching this from the sidelines but that latest proposal 
> looks very good to me!
> 
> Aljoscha
> 
> On 10.09.20 12:20, Kurt Young wrote:
>> The new syntax looks good to me.
>> Best,
>> Kurt
>> On Thu, Sep 10, 2020 at 5:57 PM Jark Wu <[email protected]> wrote:
>>> Hi Timo,
>>> 
>>> I have one minor suggestion.
>>> Maybe the default data type of `timestamp`  can be `TIMESTAMP(3) WITH
>>> LOCAL TIME ZONE`, because this is the type that users want to use, this can
>>> avoid unnecessary casting.
>>> Besides, currently, the bigint is casted to timestamp in seconds, so the
>>> implicit cast may not work...
>>> 
>>> I don't have other objections. But maybe we should wait for the
>>> opinion from @Kurt for the new syntax.
>>> 
>>> Best,
>>> Jark
>>> 
>>> 
>>> On Thu, 10 Sep 2020 at 16:21, Danny Chan <[email protected]> wrote:
>>> 
>>>> Thanks for driving this Timo, +1 for voting ~
>>>> 
>>>> Best,
>>>> Danny Chan
>>>> 在 2020年9月10日 +0800 PM3:47，Timo Walther <[email protected]>，写道：
>>>>> Thanks everyone for this healthy discussion. I updated the FLIP with the
>>>>> outcome. I think the result is very powerful but also very easy to
>>>>> declare. Thanks for all the contributions.
>>>>> 
>>>>> If there are no objections, I would continue with a voting.
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> Regards,
>>>>> Timo
>>>>> 
>>>>> 
>>>>> On 09.09.20 16:52, Timo Walther wrote:
>>>>>> "If virtual by default, when a user types "timestamp int" ==>
>>>> persisted
>>>>>> column, then adds a "metadata" after that ==> virtual column, then
>>>> adds
>>>>>> a "persisted" after that ==> persisted column."
>>>>>> 
>>>>>> Thanks for this nice mental model explanation, Jark. This makes total
>>>>>> sense to me. Also making the the most common case as short at just
>>>>>> adding `METADATA` is a very good idea. Thanks, Danny!
>>>>>> 
>>>>>> Let me update the FLIP again with all these ideas.
>>>>>> 
>>>>>> Regards,
>>>>>> Timo
>>>>>> 
>>>>>> 
>>>>>> On 09.09.20 15:03, Jark Wu wrote:
>>>>>>> I'm also +1 to Danny's proposal: timestamp INT METADATA [FROM
>>>>>>> 'my-timestamp-field'] [VIRTUAL]
>>>>>>> Especially I like the shortcut: timestamp INT METADATA, this makes
>>>> the
>>>>>>> most
>>>>>>> common case to be supported in the simplest way.
>>>>>>> 
>>>>>>> I also think the default should be "PERSISTED", so VIRTUAL is
>>>> optional
>>>>>>> when
>>>>>>> you are accessing a read-only metadata. Because:
>>>>>>> 1. The "timestamp INT METADATA" should be a normal column, because
>>>>>>> "METADATA" is just a modifier to indicate it is from metadata, a
>>>> normal
>>>>>>> column should be persisted.
>>>>>>>      If virtual by default, when a user types "timestamp int" ==>
>>>>>>> persisted
>>>>>>> column, then adds a "metadata" after that ==> virtual column, then
>>>> adds a
>>>>>>> "persisted" after that ==> persisted column.
>>>>>>>      I think this looks reversed several times and makes users
>>>> confused.
>>>>>>> Physical fields are also prefixed with "fieldName TYPE", so
>>>> "timestamp
>>>>>>> INT
>>>>>>> METADATA" is persisted is very straightforward.
>>>>>>> 2. From the collected user question [1], we can see that "timestamp"
>>>>>>> is the
>>>>>>> most common use case. "timestamp" is a read-write metadata.
>>>> Persisted by
>>>>>>> default doesn't break the reading behavior.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jark
>>>>>>> 
>>>>>>> [1]: https://issues.apache.org/jira/browse/FLINK-15869
>>>>>>> 
>>>>>>> On Wed, 9 Sep 2020 at 20:56, Leonard Xu <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Thanks @Dawid for the nice summary, I think you catch all
>>>> opinions of
>>>>>>>> the
>>>>>>>> long discussion well.
>>>>>>>> 
>>>>>>>> @Danny
>>>>>>>> “ timestamp INT METADATA [FROM 'my-timestamp-field'] [VIRTUAL]
>>>>>>>>   Note that the "FROM 'field name'" is only needed when the name
>>>>>>>> conflict
>>>>>>>>   with the declared table column name, when there are no
>>>> conflicts,
>>>>>>>> we can
>>>>>>>> simplify it to
>>>>>>>>        timestamp INT METADATA"
>>>>>>>> 
>>>>>>>> I really like the proposal, there is no confusion with computed
>>>>>>>> column any
>>>>>>>> more,  and it’s concise enough.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> @Timo @Dawid
>>>>>>>> “We use `SYSTEM_TIME` for temporal tables. I think prefixing with
>>>> SYSTEM
>>>>>>>> makes it clearer that it comes magically from the system.”
>>>>>>>> “As for the issue of shortening the SYSTEM_METADATA to METADATA.
>>>> Here I
>>>>>>>> very much prefer the SYSTEM_ prefix.”
>>>>>>>> 
>>>>>>>> I think `SYSTEM_TIME` is different with `SYSTEM_METADATA ` a lot,
>>>>>>>> First of all,  the word `TIME` has broad meanings but the word
>>>>>>>> `METADATA `
>>>>>>>> not,  `METADATA ` has specific meaning,
>>>>>>>> Secondly, `FOR SYSTEM_TIME AS OF` exists in SQL standard but
>>>>>>>> `SYSTEM_METADATA ` not.
>>>>>>>> Personally, I like more simplify way，sometimes  less is more.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Leonard
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Timo Walther <[email protected]> 于2020年9月9日周三 下午6:41写道：
>>>>>>>>> 
>>>>>>>>>> Hi everyone,
>>>>>>>>>> 
>>>>>>>>>> "key" and "value" in the properties are a special case
>>>> because they
>>>>>>>>>> need
>>>>>>>>>> to configure a format. So key and value are more than just
>>>> metadata.
>>>>>>>>>> Jark's example for setting a timestamp would work but as the
>>>> FLIP
>>>>>>>>>> discusses, we have way more metadata fields like headers,
>>>>>>>>>> epoch-leader,
>>>>>>>>>> etc. Having a property for all of this metadata would mess up
>>>> the WITH
>>>>>>>>>> section entirely. Furthermore, we also want to deal with
>>>> metadata from
>>>>>>>>>> the formats. Solving this through properties as well would
>>>> further
>>>>>>>>>> complicate the property design.
>>>>>>>>>> 
>>>>>>>>>> Personally, I still like the computed column design more
>>>> because it
>>>>>>>>>> allows to have full flexibility to compute the final column:
>>>>>>>>>> 
>>>>>>>>>> timestamp AS adjustTimestamp(CAST(SYSTEM_METADATA("ts") AS
>>>>>>>> TIMESTAMP(3)))
>>>>>>>>>> 
>>>>>>>>>> Instead of having a helper column and a real column in the
>>>> table:
>>>>>>>>>> 
>>>>>>>>>> helperTimestamp AS CAST(SYSTEM_METADATA("ts") AS TIMESTAMP(3))
>>>>>>>>>> realTimestamp AS adjustTimestamp(helperTimestamp)
>>>>>>>>>> 
>>>>>>>>>> But I see that the discussion leans towards:
>>>>>>>>>> 
>>>>>>>>>> timestamp INT SYSTEM_METADATA("ts")
>>>>>>>>>> 
>>>>>>>>>> Which is fine with me. It is the shortest solution, because
>>>> we don't
>>>>>>>>>> need additional CAST. We can discuss the syntax, so that
>>>> confusion
>>>>>>>>>> with
>>>>>>>>>> computed columns can be avoided.
>>>>>>>>>> 
>>>>>>>>>> timestamp INT USING SYSTEM_METADATA("ts")
>>>>>>>>>> timestamp INT FROM SYSTEM_METADATA("ts")
>>>>>>>>>> timestamp INT FROM SYSTEM_METADATA("ts") PERSISTED
>>>>>>>>>> 
>>>>>>>>>> We use `SYSTEM_TIME` for temporal tables. I think prefixing
>>>> with
>>>>>>>>>> SYSTEM
>>>>>>>>>> makes it clearer that it comes magically from the system.
>>>>>>>>>> 
>>>>>>>>>> What do you think?
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Timo
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 09.09.20 11:41, Jark Wu wrote:
>>>>>>>>>>> Hi Danny,
>>>>>>>>>>> 
>>>>>>>>>>> This is not Oracle and MySQL computed column syntax,
>>>> because there is
>>>>>>>> no
>>>>>>>>>>> "AS" after the type.
>>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> 
>>>>>>>>>>> If we want to use "offset INT SYSTEM_METADATA("offset")",
>>>> then I
>>>>>>>>>>> think
>>>>>>>> we
>>>>>>>>>>> must further discuss about "PERSISED" or "VIRTUAL" keyword
>>>> for
>>>>>>>> query-sink
>>>>>>>>>>> schema problem.
>>>>>>>>>>> Personally, I think we can use a shorter keyword "METADATA"
>>>> for
>>>>>>>>>>> "SYSTEM_METADATA". Because "SYSTEM_METADATA" sounds like a
>>>> system
>>>>>>>>>> function
>>>>>>>>>>> and confuse users this looks like a computed column.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 9 Sep 2020 at 17:23, Danny Chan <
>>>> [email protected]> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> "offset INT SYSTEM_METADATA("offset")"
>>>>>>>>>>>> 
>>>>>>>>>>>> This is actually Oracle or MySQL style computed column
>>>> syntax.
>>>>>>>>>>>> 
>>>>>>>>>>>> "You are right that one could argue that "timestamp",
>>>> "headers" are
>>>>>>>>>>>> something like "key" and "value""
>>>>>>>>>>>> 
>>>>>>>>>>>> I have the same feeling, both key value and headers
>>>> timestamp are
>>>>>>>> *real*
>>>>>>>>>>>> data
>>>>>>>>>>>> stored in the consumed record, they are not computed or
>>>> generated.
>>>>>>>>>>>> 
>>>>>>>>>>>> "Trying to solve everything via properties sounds rather
>>>> like a hack
>>>>>>>> to
>>>>>>>>>>>> me"
>>>>>>>>>>>> 
>>>>>>>>>>>> Things are not that hack if we can unify the routines or
>>>> the
>>>>>>>> definitions
>>>>>>>>>>>> (all from the computed column way or all from the table
>>>> options), i
>>>>>>>> also
>>>>>>>>>>>> think that it is a hacky that we mix in 2 kinds of syntax
>>>> for
>>>>>>>> different
>>>>>>>>>>>> kinds of metadata (read-only and read-write). In this
>>>> FLIP, we
>>>>>>>>>>>> declare
>>>>>>>>>> the
>>>>>>>>>>>> Kafka key fields with table options but SYSTEM_METADATA
>>>> for other
>>>>>>>>>> metadata,
>>>>>>>>>>>> that is a hacky thing or something in-consistent.
>>>>>>>>>>>> 
>>>>>>>>>>>> Kurt Young <[email protected]> 于2020年9月9日周三 下午4:48写道：
>>>>>>>>>>>> 
>>>>>>>>>>>>>   I would vote for `offset INT
>>>> SYSTEM_METADATA("offset")`.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't think we can stick with the SQL standard in DDL
>>>> part
>>>>>>>>>>>>> forever,
>>>>>>>>>>>>> especially as there are more and more
>>>>>>>>>>>>> requirements coming from different connectors and
>>>> external systems.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Kurt
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Sep 9, 2020 at 4:40 PM Timo Walther <
>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Jark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> now we are back at the original design proposed by
>>>> Dawid :D
>>>>>>>>>>>>>> Yes, we
>>>>>>>>>>>>>> should be cautious about adding new syntax. But the
>>>> length of this
>>>>>>>>>>>>>> discussion shows that we are looking for a good
>>>> long-term
>>>>>>>>>>>>>> solution.
>>>>>>>> In
>>>>>>>>>>>>>> this case I would rather vote for a deep integration
>>>> into the
>>>>>>>> syntax.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Computed columns are also not SQL standard compliant.
>>>> And our
>>>>>>>>>>>>>> DDL is
>>>>>>>>>>>>>> neither, so we have some degree of freedom here.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Trying to solve everything via properties sounds
>>>> rather like a
>>>>>>>>>>>>>> hack
>>>>>>>> to
>>>>>>>>>>>>>> me. You are right that one could argue that
>>>> "timestamp", "headers"
>>>>>>>> are
>>>>>>>>>>>>>> something like "key" and "value". However, mixing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> `offset AS SYSTEM_METADATA("offset")`
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> `'timestamp.field' = 'ts'`
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> looks more confusing to users that an explicit
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> `offset AS CAST(SYSTEM_METADATA("offset") AS INT)`
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> `offset INT SYSTEM_METADATA("offset")`
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> that is symetric for both source and sink.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What do others think?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 09.09.20 10:09, Jark Wu wrote:
>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think we have a conclusion that the writable
>>>> metadata shouldn't
>>>>>>>> be
>>>>>>>>>>>>>>> defined as a computed column, but a normal column.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "timestamp STRING SYSTEM_METADATA('timestamp')" is
>>>> one of the
>>>>>>>>>>>>> approaches.
>>>>>>>>>>>>>>> However, it is not SQL standard compliant, we need
>>>> to be cautious
>>>>>>>>>>>>> enough
>>>>>>>>>>>>>>> when adding new syntax.
>>>>>>>>>>>>>>> Besides, we have to introduce the `PERSISTED` or
>>>> `VIRTUAL`
>>>>>>>>>>>>>>> keyword
>>>>>>>> to
>>>>>>>>>>>>>>> resolve the query-sink schema problem if it is
>>>> read-only
>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>> That
>>>>>>>>>>>>>>> adds more stuff to learn for users.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  From my point of view, the "timestamp",
>>>> "headers" are something
>>>>>>>> like
>>>>>>>>>>>>>> "key"
>>>>>>>>>>>>>>> and "value" that stores with the real data. So why
>>>> not define the
>>>>>>>>>>>>>>> "timestamp" in the same way with "key" by using a
>>>>>>>>>>>>>>> "timestamp.field"
>>>>>>>>>>>>>>> connector option?
>>>>>>>>>>>>>>> On the other side, the read-only metadata, such as
>>>> "offset",
>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> defined as a normal column. So why not use the
>>>> existing computed
>>>>>>>>>>>> column
>>>>>>>>>>>>>>> syntax for such metadata? Then we don't have the
>>>> query-sink
>>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>> So here is my proposal:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> CREATE TABLE kafka_table (
>>>>>>>>>>>>>>>     id BIGINT,
>>>>>>>>>>>>>>>     name STRING,
>>>>>>>>>>>>>>>     col1 STRING,
>>>>>>>>>>>>>>>     col2 STRING,
>>>>>>>>>>>>>>>     ts TIMESTAMP(3) WITH LOCAL TIME ZONE,    -- ts
>>>> is a normal
>>>>>>>> field,
>>>>>>>>>>>> so
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> be read and written.
>>>>>>>>>>>>>>>     offset AS SYSTEM_METADATA("offset")
>>>>>>>>>>>>>>> ) WITH (
>>>>>>>>>>>>>>>     'connector' = 'kafka',
>>>>>>>>>>>>>>>     'topic' = 'test-topic',
>>>>>>>>>>>>>>>     'key.fields' = 'id, name',
>>>>>>>>>>>>>>>     'key.format' = 'csv',
>>>>>>>>>>>>>>>     'value.format' = 'avro',
>>>>>>>>>>>>>>>     'timestamp.field' = 'ts'    -- define the
>>>> mapping of Kafka
>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>> );
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> INSERT INTO kafka_table
>>>>>>>>>>>>>>> SELECT id, name, col1, col2, rowtime FROM
>>>> another_table;
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I think this can solve all the problems without
>>>> introducing
>>>>>>>>>>>>>>> any new
>>>>>>>>>>>>>> syntax.
>>>>>>>>>>>>>>> The only minor disadvantage is that we separate the
>>>> definition
>>>>>>>>>>>>> way/syntax
>>>>>>>>>>>>>>> of read-only metadata and read-write fields.
>>>>>>>>>>>>>>> However, I don't think this is a big problem.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, 9 Sep 2020 at 15:09, Timo Walther <
>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Kurt,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> thanks for sharing your opinion. I'm totally up
>>>> for not reusing
>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>> columns. I think Jark was a big supporter of this
>>>> syntax, @Jark
>>>>>>>> are
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> fine with this as well? The non-computed column
>>>> approach was
>>>>>>>>>>>>>>>> only
>>>>>>>> a
>>>>>>>>>>>>>>>> "slightly rejected alternative".
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Furthermore, we would need to think about how
>>>> such a new design
>>>>>>>>>>>>>>>> influences the LIKE clause though.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> However, we should still keep the `PERSISTED`
>>>> keyword as it
>>>>>>>>>>>> influences
>>>>>>>>>>>>>>>> the query->sink schema. If you look at the list
>>>> of metadata for
>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>> connectors and formats, we currently offer only
>>>> two writable
>>>>>>>>>>>> metadata
>>>>>>>>>>>>>>>> fields. Otherwise, one would need to declare two
>>>> tables
>>>>>>>>>>>>>>>> whenever a
>>>>>>>>>>>>>>>> metadata columns is read (one for the source, one
>>>> for the sink).
>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> can be quite inconvientient e.g. for just reading
>>>> the topic.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 09.09.20 08:52, Kurt Young wrote:
>>>>>>>>>>>>>>>>> I also share the concern that reusing the
>>>> computed column
>>>>>>>>>>>>>>>>> syntax
>>>>>>>>>>>> but
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> different semantics
>>>>>>>>>>>>>>>>> would confuse users a lot.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Besides, I think metadata fields are
>>>> conceptually not the same
>>>>>>>> with
>>>>>>>>>>>>>>>>> computed columns. The metadata
>>>>>>>>>>>>>>>>> field is a connector specific thing and it only
>>>> contains the
>>>>>>>>>>>>>> information
>>>>>>>>>>>>>>>>> that where does the field come
>>>>>>>>>>>>>>>>> from (during source) or where does the field
>>>> need to write to
>>>>>>>>>>>> (during
>>>>>>>>>>>>>>>>> sink). It's more similar with normal
>>>>>>>>>>>>>>>>> fields, with assumption that all these fields
>>>> need going to the
>>>>>>>>>>>> data
>>>>>>>>>>>>>>>> part.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thus I'm more lean to the rejected alternative
>>>> that Timo
>>>>>>>> mentioned.
>>>>>>>>>>>>>> And I
>>>>>>>>>>>>>>>>> think we don't need the
>>>>>>>>>>>>>>>>> PERSISTED keyword, SYSTEM_METADATA should be
>>>> enough.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> During implementation, the framework only needs
>>>> to pass such
>>>>>>>>>>>> <field,
>>>>>>>>>>>>>>>>> metadata field> information to the
>>>>>>>>>>>>>>>>> connector, and the logic of handling such
>>>> fields inside the
>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>> should be straightforward.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regarding the downside Timo mentioned:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The disadvantage is that users cannot call
>>>> UDFs or parse
>>>>>>>>>>>> timestamps.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I think this is fairly simple to solve. Since
>>>> the metadata
>>>>>>>>>>>>>>>>> field
>>>>>>>>>>>>> isn't
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> computed column anymore, we can support
>>>>>>>>>>>>>>>>> referencing such fields in the computed column.
>>>> For example:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> CREATE TABLE kafka_table (
>>>>>>>>>>>>>>>>>         id BIGINT,
>>>>>>>>>>>>>>>>>         name STRING,
>>>>>>>>>>>>>>>>>         timestamp STRING
>>>> SYSTEM_METADATA("timestamp"),  //
>>>>>>>>>>>>>>>>> get the
>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>>> field from metadata
>>>>>>>>>>>>>>>>>         ts AS to_timestamp(timestamp) // normal
>>>> computed
>>>>>>>>>>>>>>>>> column,
>>>>>>>>>>>> parse
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> string to TIMESTAMP type by using the metadata
>>>> field
>>>>>>>>>>>>>>>>> ) WITH (
>>>>>>>>>>>>>>>>>        ...
>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Kurt
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Sep 8, 2020 at 11:57 PM Timo Walther
>>>>>>>>>>>>>>>>> <[email protected]
>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Leonard,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> the only alternative I see is that we
>>>> introduce a concept that
>>>>>>>> is
>>>>>>>>>>>>>>>>>> completely different to computed columns.
>>>> This is also
>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> rejected alternative section of the FLIP.
>>>> Something like:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> CREATE TABLE kafka_table (
>>>>>>>>>>>>>>>>>>         id BIGINT,
>>>>>>>>>>>>>>>>>>         name STRING,
>>>>>>>>>>>>>>>>>>         timestamp INT
>>>> SYSTEM_METADATA("timestamp") PERSISTED,
>>>>>>>>>>>>>>>>>>         headers MAP<STRING, BYTES>
>>>> SYSTEM_METADATA("headers")
>>>>>>>>>>>>> PERSISTED
>>>>>>>>>>>>>>>>>> ) WITH (
>>>>>>>>>>>>>>>>>>        ...
>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> This way we would avoid confusion at all and
>>>> can easily map
>>>>>>>>>>>> columns
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> metadata columns. The disadvantage is that
>>>> users cannot call
>>>>>>>> UDFs
>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> parse timestamps. This would need to be done
>>>> in a real
>>>>>>>>>>>>>>>>>> computed
>>>>>>>>>>>>>> column.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I'm happy about better alternatives.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 08.09.20 15:37, Leonard Xu wrote:
>>>>>>>>>>>>>>>>>>> HI, Timo
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for driving this FLIP.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Sorry but I have a concern about Writing
>>>> metadata via
>>>>>>>>>>>>>> DynamicTableSink
>>>>>>>>>>>>>>>>>> section:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> CREATE TABLE kafka_table (
>>>>>>>>>>>>>>>>>>>       id BIGINT,
>>>>>>>>>>>>>>>>>>>       name STRING,
>>>>>>>>>>>>>>>>>>>       timestamp AS
>>>> CAST(SYSTEM_METADATA("timestamp") AS
>>>>>>>>>>>>>>>>>>> BIGINT)
>>>>>>>>>>>>>>>> PERSISTED,
>>>>>>>>>>>>>>>>>>>       headers AS
>>>> CAST(SYSTEM_METADATA("headers") AS
>>>>>>>>>>>>>>>>>>> MAP<STRING,
>>>>>>>>>>>>>> BYTES>)
>>>>>>>>>>>>>>>>>> PERSISTED
>>>>>>>>>>>>>>>>>>> ) WITH (
>>>>>>>>>>>>>>>>>>>       ...
>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>> An insert statement could look like:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> INSERT INTO kafka_table VALUES (
>>>>>>>>>>>>>>>>>>>       (1, "ABC", 1599133672, MAP('checksum',
>>>>>>>>>>>> computeChecksum(...)))
>>>>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The proposed INERT syntax does not make
>>>> sense to me,
>>>>>>>>>>>>>>>>>>> because it
>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>> computed(generated) column.
>>>>>>>>>>>>>>>>>>> Both SQL server and Postgresql do not allow
>>>> to insert
>>>>>>>>>>>>>>>>>>> value to
>>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>> columns even they are persisted, this boke
>>>> the generated
>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>> semantics
>>>>>>>>>>>>>>>>>> and may confuse user much.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For SQL server computed column[1]:
>>>>>>>>>>>>>>>>>>>> column_name AS computed_column_expression
>>>> [ PERSISTED [ NOT
>>>>>>>>>>>> NULL ]
>>>>>>>>>>>>>>>> ]...
>>>>>>>>>>>>>>>>>>>> NOTE: A computed column cannot be the
>>>> target of an INSERT or
>>>>>>>>>>>>> UPDATE
>>>>>>>>>>>>>>>>>> statement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For Postgresql generated column[2]:
>>>>>>>>>>>>>>>>>>>>      height_in numeric GENERATED ALWAYS
>>>> AS (height_cm /
>>>>>>>>>>>>>>>>>>>> 2.54)
>>>>>>>>>>>>> STORED
>>>>>>>>>>>>>>>>>>>> NOTE: A generated column cannot be
>>>> written to directly. In
>>>>>>>>>>>> INSERT
>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> UPDATE commands, a value cannot be specified
>>>> for a generated
>>>>>>>>>>>> column,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> the keyword DEFAULT may be specified.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It shouldn't be allowed to set/update value
>>>> for generated
>>>>>>>> column
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>> lookup the SQL 2016:
>>>>>>>>>>>>>>>>>>>> <insert statement> ::=
>>>>>>>>>>>>>>>>>>>> INSERT INTO <insertion target> <insert
>>>> columns and source>
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> If <contextually typed table value
>>>> constructor> CTTVC is
>>>>>>>>>>>>> specified,
>>>>>>>>>>>>>>>>>> then every <contextually typed row
>>>>>>>>>>>>>>>>>>>> value constructor element> simply
>>>> contained in CTTVC whose
>>>>>>>>>>>>>>>> positionally
>>>>>>>>>>>>>>>>>> corresponding <column name>
>>>>>>>>>>>>>>>>>>>> in <insert column list> references a
>>>> column of which some
>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>>>>>> column is a generated column shall
>>>>>>>>>>>>>>>>>>>> be a <default specification>.
>>>>>>>>>>>>>>>>>>>> A <default specification> specifies the
>>>> default value of
>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> associated item.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://docs.microsoft.com/en-US/sql/t-sql/statements/alter-table-computed-column-definition-transact-sql?view=sql-server-ver15
>>>>>>>> 
>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://docs.microsoft.com/en-US/sql/t-sql/statements/alter-table-computed-column-definition-transact-sql?view=sql-server-ver15
>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>> 
>>>> https://www.postgresql.org/docs/12/ddl-generated-columns.html
>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>> 
>>>> https://www.postgresql.org/docs/12/ddl-generated-columns.html>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 在 2020年9月8日，17:31，Timo Walther <
>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>> 写道：
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Jark,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> according to Flink's and Calcite's
>>>> casting definition in
>>>>>>>> [1][2]
>>>>>>>>>>>>>>>>>> TIMESTAMP WITH LOCAL TIME ZONE should be
>>>> castable from BIGINT.
>>>>>>>> If
>>>>>>>>>>>>> not,
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>> will make it possible ;-)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm aware of
>>>> DeserializationSchema.getProducedType but I
>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> this method is actually misplaced. The type
>>>> should rather be
>>>>>>>>>>>> passed
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> source itself.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> For our Kafka SQL source, we will also
>>>> not use this method
>>>>>>>>>>>> because
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> Kafka source will add own metadata in
>>>> addition to the
>>>>>>>>>>>>>>>>>> DeserializationSchema. So
>>>>>>>>>>>>>>>>>> DeserializationSchema.getProducedType
>>>>>>>>>>>> will
>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>>>>>> be read.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> For now I suggest to leave out the
>>>> `DataType` from
>>>>>>>>>>>>>>>>>> DecodingFormat.applyReadableMetadata. Also
>>>> because the
>>>>>>>>>>>>>>>>>> format's
>>>>>>>>>>>>>> physical
>>>>>>>>>>>>>>>>>> type is passed later in
>>>> `createRuntimeDecoder`. If
>>>>>>>>>>>>>>>>>> necessary, it
>>>>>>>>>>>> can
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> computed manually by consumedType + metadata
>>>> types. We will
>>>>>>>>>>>> provide
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> metadata utility class for that.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/types/logical/utils/LogicalTypeCasts.java#L200
>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql/type/SqlTypeCoercionRule.java#L254
>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 08.09.20 10:52, Jark Wu wrote:
>>>>>>>>>>>>>>>>>>>>> Hi Timo,
>>>>>>>>>>>>>>>>>>>>> The updated CAST SYSTEM_METADATA
>>>> behavior sounds good to
>>>>>>>>>>>>>>>>>>>>> me.
>>>>>>>> I
>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> noticed
>>>>>>>>>>>>>>>>>>>>> that a BIGINT can't be converted to
>>>> "TIMESTAMP(3) WITH
>>>>>>>>>>>>>>>>>>>>> LOCAL
>>>>>>>>>>>> TIME
>>>>>>>>>>>>>>>>>> ZONE".
>>>>>>>>>>>>>>>>>>>>> So maybe we need to support this, or
>>>> use "TIMESTAMP(3) WITH
>>>>>>>>>>>> LOCAL
>>>>>>>>>>>>>>>> TIME
>>>>>>>>>>>>>>>>>>>>> ZONE" as the defined type of Kafka
>>>> timestamp? I think this
>>>>>>>>>>>> makes
>>>>>>>>>>>>>>>> sense,
>>>>>>>>>>>>>>>>>>>>> because it represents the milli-seconds
>>>> since epoch.
>>>>>>>>>>>>>>>>>>>>> Regarding "DeserializationSchema
>>>> doesn't need TypeInfo", I
>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>> so.
>>>>>>>>>>>>>>>>>>>>> The DeserializationSchema implements
>>>> ResultTypeQueryable,
>>>>>>>> thus
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> implementation needs to return an
>>>> output TypeInfo.
>>>>>>>>>>>>>>>>>>>>> Besides, FlinkKafkaConsumer also
>>>>>>>>>>>>>>>>>>>>> calls
>>>> DeserializationSchema.getProducedType as the produced
>>>>>>>>>>>> type
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> source function [1].
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaConsumerBase.java#L1066
>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, 8 Sep 2020 at 16:35, Timo
>>>> Walther <
>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I updated the FLIP again and hope
>>>> that I could address the
>>>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>>>>>>>>> concerns.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Leonard: Thanks for the explanation.
>>>> I wasn't aware that
>>>>>>>>>>>> ts_ms
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> source.ts_ms have different
>>>> semantics. I updated the FLIP
>>>>>>>> and
>>>>>>>>>>>>>> expose
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> most commonly used properties
>>>> separately. So frequently
>>>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> properties
>>>>>>>>>>>>>>>>>>>>>> are not hidden in the MAP anymore:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> debezium-json.ingestion-timestamp
>>>>>>>>>>>>>>>>>>>>>> debezium-json.source.timestamp
>>>>>>>>>>>>>>>>>>>>>> debezium-json.source.database
>>>>>>>>>>>>>>>>>>>>>> debezium-json.source.schema
>>>>>>>>>>>>>>>>>>>>>> debezium-json.source.table
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> However, since other properties
>>>> depend on the used
>>>>>>>>>>>>>> connector/vendor,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> remaining options are stored in:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> debezium-json.source.properties
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> And accessed with:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>> CAST(SYSTEM_METADATA('debezium-json.source.properties') AS
>>>>>>>>>>>>>>>> MAP<STRING,
>>>>>>>>>>>>>>>>>>>>>> STRING>)['table']
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Otherwise it is not possible to
>>>> figure out the value and
>>>>>>>>>>>> column
>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>>>> during validation.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Jark: You convinced me in relaxing
>>>> the CAST
>>>>>>>>>>>>>>>>>>>>>> constraints. I
>>>>>>>>>>>>> added
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> dedicacated sub-section to the FLIP:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> For making the use of SYSTEM_METADATA
>>>> easier and avoid
>>>>>>>> nested
>>>>>>>>>>>>>>>> casting
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>> allow explicit casting to a target
>>>> data type:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> rowtime AS
>>>> CAST(SYSTEM_METADATA("timestamp") AS
>>>>>>>>>>>>>>>>>>>>>> TIMESTAMP(3)
>>>>>>>>>>>>> WITH
>>>>>>>>>>>>>>>>>> LOCAL
>>>>>>>>>>>>>>>>>>>>>> TIME ZONE)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> A connector still produces and
>>>> consumes the data type
>>>>>>>> returned
>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>>>>>> `listMetadata()`. The planner will
>>>> insert necessary
>>>>>>>>>>>>>>>>>>>>>> explicit
>>>>>>>>>>>>>> casts.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> In any case, the user must provide a
>>>> CAST such that the
>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>>>>> receives a valid data type when
>>>> constructing the table
>>>>>>>> schema.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> "I don't see a reason why
>>>>>>>>>>>> `DecodingFormat#applyReadableMetadata`
>>>>>>>>>>>>>>>>>> needs a
>>>>>>>>>>>>>>>>>>>>>> DataType argument."
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Correct he DeserializationSchema
>>>> doesn't need TypeInfo, it
>>>>>>>> is
>>>>>>>>>>>>>> always
>>>>>>>>>>>>>>>>>>>>>> executed locally. It is the source
>>>> that needs TypeInfo for
>>>>>>>>>>>>>>>> serializing
>>>>>>>>>>>>>>>>>>>>>> the record to the next operator. And
>>>> that's this is
>>>>>>>>>>>>>>>>>>>>>> what we
>>>>>>>>>>>>>> provide.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> @Danny:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> “SYSTEM_METADATA("offset")` returns
>>>> the NULL type by
>>>>>>>> default”
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> We can also use some other means to
>>>> represent an UNKNOWN
>>>>>>>> data
>>>>>>>>>>>>>> type.
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>>> the Flink type system, we use the
>>>> NullType for it. The
>>>>>>>>>>>> important
>>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> that the final data type is known for
>>>> the entire computed
>>>>>>>>>>>>> column.
>>>>>>>>>>>>>>>> As I
>>>>>>>>>>>>>>>>>>>>>> mentioned before, I would avoid the
>>>> suggested option b)
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> similar to your suggestion. The CAST
>>>> should be enough and
>>>>>>>>>>>> allows
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> complex expressions in the computed
>>>> column. Option b)
>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>> parser
>>>>>>>>>>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 08.09.20 06:21, Leonard Xu wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi, Timo
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for you explanation and
>>>> update,  I have only one
>>>>>>>>>>>>> question
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> the latest FLIP.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> About the MAP<STRING, STRING>
>>>> DataType of key
>>>>>>>>>>>>>>>>>> 'debezium-json.source', if
>>>>>>>>>>>>>>>>>>>>>> user want to use the table name
>>>> metadata, they need to
>>>>>>>> write:
>>>>>>>>>>>>>>>>>>>>>>> tableName STRING AS
>>>>>>>>>>>> CAST(SYSTEM_METADATA('debeuim-json.source')
>>>>>>>>>>>>>> AS
>>>>>>>>>>>>>>>>>>>>>> MAP<STRING, STRING>)['table']
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> the expression is a little complex
>>>> for user, Could we
>>>>>>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>>>>>>>>> necessary metas with simple DataType
>>>> as following?
>>>>>>>>>>>>>>>>>>>>>>> tableName STRING AS
>>>>>>>>>>>>>>>>>> 
>>>> CAST(SYSTEM_METADATA('debeuim-json.source.table') AS
>>>>>>>>>>>>>>>>>>>>>> STRING),
>>>>>>>>>>>>>>>>>>>>>>> transactionTime LONG AS
>>>>>>>>>>>>>>>>>>>>>> 
>>>> CAST(SYSTEM_METADATA('debeuim-json.source.ts_ms') AS
>>>>>>>> BIGINT),
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> In this way, we can simplify the
>>>> expression, the mainly
>>>>>>>> used
>>>>>>>>>>>>>>>>>> metadata in
>>>>>>>>>>>>>>>>>>>>>> changelog format may include
>>>>>>>>>>>>>>>>>> 'database','table','source.ts_ms','ts_ms' from
>>>>>>>>>>>>>>>>>>>>>> my side,
>>>>>>>>>>>>>>>>>>>>>>> maybe we could only support them at
>>>> first version.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Both Debezium and Canal have above
>>>> four metadata, and I‘m
>>>>>>>>>>>>> willing
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> take some subtasks in next
>>>> development if necessary.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Debezium:
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>>        "before": null,
>>>>>>>>>>>>>>>>>>>>>>>        "after": {  "id":
>>>> 101,"name": "scooter"},
>>>>>>>>>>>>>>>>>>>>>>>        "source": {
>>>>>>>>>>>>>>>>>>>>>>>          "db":
>>>> "inventory",                  # 1.
>>>>>>>>>>>>>>>>>>>>>>> database
>>>>>>>>>>>> name
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> changelog belongs to.
>>>>>>>>>>>>>>>>>>>>>>>          "table":
>>>> "products",                # 2.
>>>>>>>>>>>>>>>>>>>>>>> table name
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> changelog
>>>>>>>>>>>>>>>>>>>>>> belongs to.
>>>>>>>>>>>>>>>>>>>>>>>          "ts_ms":
>>>> 1589355504100,             # 3.
>>>>>>>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>>>> happened in database system, i.e.:
>>>> transaction time in
>>>>>>>>>>>> database.
>>>>>>>>>>>>>>>>>>>>>>>          "connector": "mysql",
>>>>>>>>>>>>>>>>>>>>>>>          ….
>>>>>>>>>>>>>>>>>>>>>>>        },
>>>>>>>>>>>>>>>>>>>>>>>        "ts_ms":
>>>> 1589355606100,              # 4.
>>>>>>>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>> when
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> debezium
>>>>>>>>>>>>>>>>>>>>>> processed the changelog.
>>>>>>>>>>>>>>>>>>>>>>>        "op": "c",
>>>>>>>>>>>>>>>>>>>>>>>        "transaction": null
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Canal:
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>>        "data": [{  "id": "102",
>>>> "name": "car battery" }],
>>>>>>>>>>>>>>>>>>>>>>>        "database":
>>>> "inventory",      # 1. database
>>>>>>>>>>>>>>>>>>>>>>> name the
>>>>>>>>>>>>>> changelog
>>>>>>>>>>>>>>>>>>>>>> belongs to.
>>>>>>>>>>>>>>>>>>>>>>>        "table":
>>>> "products",          # 2. table name the
>>>>>>>>>>>>> changelog
>>>>>>>>>>>>>>>>>> belongs
>>>>>>>>>>>>>>>>>>>>>> to.
>>>>>>>>>>>>>>>>>>>>>>>        "es":
>>>> 1589374013000,          # 3. execution
>>>>>>>>>>>>>>>>>>>>>>> time of
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> database system, i.e.: transaction
>>>> time in database.
>>>>>>>>>>>>>>>>>>>>>>>        "ts":
>>>> 1589374013680,          # 4. timestamp
>>>>>>>>>>>>>>>>>>>>>>> when the
>>>>>>>>>>>>>> cannal
>>>>>>>>>>>>>>>>>>>>>> processed the changelog.
>>>>>>>>>>>>>>>>>>>>>>>        "isDdl": false,
>>>>>>>>>>>>>>>>>>>>>>>        "mysqlType": {},
>>>>>>>>>>>>>>>>>>>>>>>        ....
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Best
>>>>>>>>>>>>>>>>>>>>>>> Leonard
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 在 2020年9月8日，11:57，Danny Chan
>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> 写道：
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks Timo ~
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The FLIP was already in pretty
>>>> good shape, I have only 2
>>>>>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 1.
>>>> “`CAST(SYSTEM_METADATA("offset") AS INT)` would be a
>>>>>>>>>>>> valid
>>>>>>>>>>>>>>>>>> read-only
>>>>>>>>>>>>>>>>>>>>>> computed column for Kafka and can be
>>>> extracted by the
>>>>>>>>>>>> planner.”
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> What is the pros we follow the
>>>> SQL-SERVER syntax here ?
>>>>>>>>>>>>> Usually
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>>>>> expression return type can be
>>>> inferred automatically.
>>>>>>>>>>>>>>>>>>>>>> But I
>>>>>>>>>>>>> guess
>>>>>>>>>>>>>>>>>>>>>> SQL-SERVER does not have function
>>>> like SYSTEM_METADATA
>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>>>>>> not have a specific return type.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> And why not use the Oracle or
>>>> MySQL syntax there ?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> column_name [datatype] [GENERATED
>>>> ALWAYS] AS
>>>>>>>>>>>>>>>>>>>>>>>> (expression)
>>>>>>>>>>>>>>>> [VIRTUAL]
>>>>>>>>>>>>>>>>>>>>>>>> Which is more straight-forward.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 2. “SYSTEM_METADATA("offset")`
>>>> returns the NULL type by
>>>>>>>>>>>>> default”
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> The default type should not be
>>>> NULL because only NULL
>>>>>>>>>>>> literal
>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>>>>>> that. Usually we use ANY as the type
>>>> if we do not know the
>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>>> type in
>>>>>>>>>>>>>>>>>>>>>> the SQL context. ANY means the
>>>> physical value can be any
>>>>>>>> java
>>>>>>>>>>>>>>>> object.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>> https://oracle-base.com/articles/11g/virtual-columns-11gr1
>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://dev.mysql.com/doc/refman/5.7/en/create-table-generated-columns.html
>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Danny Chan
>>>>>>>>>>>>>>>>>>>>>>>> 在 2020年9月4日 +0800 PM4:48，Timo
>>>> Walther
>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]
>>>>>>>>>>>>>> ，写道：
>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I completely reworked FLIP-107.
>>>> It now covers the full
>>>>>>>>>>>> story
>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> read
>>>>>>>>>>>>>>>>>>>>>>>>> and write metadata from/to
>>>> connectors and formats. It
>>>>>>>>>>>>> considers
>>>>>>>>>>>>>>>>>> all of
>>>>>>>>>>>>>>>>>>>>>>>>> the latest FLIPs, namely
>>>> FLIP-95, FLIP-132 and
>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-122.
>>>>>>>> It
>>>>>>>>>>>>>>>>>> introduces
>>>>>>>>>>>>>>>>>>>>>>>>> the concept of PERSISTED
>>>> computed columns and leaves
>>>>>>>>>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>>>>> for now.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your
>>>> feedback.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> Timo
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 04.03.20 09:45, Kurt Young
>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry, forgot one question.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Can we make the
>>>> value.fields-include more
>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal?
>>>>>>>>>>>>> Like
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>>>>>>>> specify it as "EXCEPT_KEY,
>>>> EXCEPT_TIMESTAMP".
>>>>>>>>>>>>>>>>>>>>>>>>>> With current EXCEPT_KEY and
>>>> EXCEPT_KEY_TIMESTAMP,
>>>>>>>>>>>>>>>>>>>>>>>>>> users
>>>>>>>>>>>> can
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>> config to
>>>>>>>>>>>>>>>>>>>>>>>>>> just ignore timestamp but
>>>> keep key.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>> Kurt
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Mar 4, 2020 at 4:42
>>>> PM Kurt Young <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dawid,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have a couple of
>>>> questions around key fields,
>>>>>>>> actually
>>>>>>>>>>>> I
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>> other questions but want to
>>>> be focused on key fields
>>>>>>>>>>>> first.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. I don't fully understand
>>>> the usage of
>>>>>>>>>>>>>>>>>>>>>>>>>>> "key.fields".
>>>>>>>> Is
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> option only
>>>>>>>>>>>>>>>>>>>>>>>>>>> valid during write
>>>> operation? Because for
>>>>>>>>>>>>>>>>>>>>>>>>>>> reading, I can't imagine
>>>> how such options can be
>>>>>>>>>>>> applied. I
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>>>>>>>>>>>>> that there might be a
>>>> SYSTEM_METADATA("key")
>>>>>>>>>>>>>>>>>>>>>>>>>>> to read and assign the key
>>>> to a normal field?
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If "key.fields" is only
>>>> valid in write
>>>>>>>>>>>>>>>>>>>>>>>>>>> operation, I
>>>>>>>>>>>> want
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> propose we
>>>>>>>>>>>>>>>>>>>>>>>>>>> can simplify the options to
>>>> not introducing
>>>>>>>>>>>> key.format.type
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> other related options. I
>>>> think a single "key.field"
>>>>>>>> (not
>>>>>>>>>>>>>>>> fields)
>>>>>>>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>>>>>>>>> enough, users can use UDF
>>>> to calculate whatever key
>>>>>>>> they
>>>>>>>>>>>>>>>>>>>>>>>>>>> want before sink.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. Also I don't want to
>>>> introduce "value.format.type"
>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> "value.format.xxx" with the
>>>> "value" prefix. Not every
>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>> has a
>>>>>>>>>>>>>>>>>>>>>>>>>>> concept
>>>>>>>>>>>>>>>>>>>>>>>>>>> of key and values. The old
>>>> parameter "format.type"
>>>>>>>>>>>> already
>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>>>>>>>>> enough to
>>>>>>>>>>>>>>>>>>>>>>>>>>> use.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Kurt
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Mar 3, 2020 at
>>>> 10:40 PM Jark Wu <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Dawid,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have two more questions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> SupportsMetadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Introducing
>>>> SupportsMetadata sounds good to me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> But I
>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> questions
>>>>>>>>>>>>>>>>>>>>>>>>>>>> regarding to this
>>>> interface.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) How do the source know
>>>> what the expected return
>>>>>>>> type
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>>>>>>>>> metadata?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Where to put the
>>>> metadata fields? Append to the
>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>>>> physical
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fields?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> If yes, I would suggest
>>>> to change the signature to
>>>>>>>>>>>>>>>> `TableSource
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> appendMetadataFields(String[] metadataNames,
>>>>>>>> DataType[]
>>>>>>>>>>>>>>>>>>>>>> metadataTypes)`
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> SYSTEM_METADATA("partition")
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Can SYSTEM_METADATA()
>>>> function be used nested in a
>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>>>>>>>>>>> expression? If yes, how
>>>> to specify the return
>>>>>>>>>>>>>>>>>>>>>>>>>>>> type of
>>>>>>>>>>>>>>>>>>>>>> SYSTEM_METADATA?
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 Mar 2020 at
>>>> 17:06, Dawid Wysakowicz <
>>>>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. I thought a bit more
>>>> on how the source would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> emit
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> columns
>>>>>>>>>>>>>>>>>>>>>> and I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> now see its not exactly
>>>> the same as regular
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> columns.
>>>>>>>> I
>>>>>>>>>>>>> see
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elaborate a bit more on
>>>> that in the FLIP as you
>>>>>>>> asked,
>>>>>>>>>>>>>> Jark.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do agree mostly with
>>>> Danny on how we should do
>>>>>>>> that.
>>>>>>>>>>>>> One
>>>>>>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> things I would
>>>> introduce is an
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interface
>>>> SupportsMetadata {
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> boolean
>>>> supportsMetadata(Set<String>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadataFields);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableSource
>>>> generateMetadataFields(Set<String>
>>>>>>>>>>>>>>>> metadataFields);
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This way the source
>>>> would have to declare/emit only
>>>>>>>> the
>>>>>>>>>>>>>>>>>> requested
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata fields. In
>>>> order not to clash with user
>>>>>>>>>>>> defined
>>>>>>>>>>>>>>>>>> fields.
>>>>>>>>>>>>>>>>>>>>>> When
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> emitting the metadata
>>>> field I would prepend the
>>>>>>>> column
>>>>>>>>>>>>> name
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> __system_{property_name}. Therefore when requested
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> SYSTEM_METADATA("partition") the source would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> append
>>>>>>>> a
>>>>>>>>>>>>>> field
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __system_partition to
>>>> the schema. This would be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> never
>>>>>>>>>>>>>> visible
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> user as it would be
>>>> used only for the subsequent
>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>> columns.
>>>>>>>>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that makes sense to
>>>> you, I will update the FLIP
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> description.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. CAST vs explicit
>>>> type in computed columns
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here I agree with
>>>> Danny. It is also the current
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> state
>>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. Partitioning on
>>>> computed column vs function
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here I also agree with
>>>> Danny. I also think those
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> orthogonal. I
>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> leave out the STORED
>>>> computed columns out of the
>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> don't see
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> how do they relate to
>>>> the partitioning. I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> already put
>>>>>>>>>>>>> both
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cases in the document.
>>>> We can either partition on a
>>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>>>>>> column or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use a udf in a
>>>> partioned by clause. I am fine with
>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioning by udf in
>>>> the first version if you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for your question
>>>> Danny. It depends which
>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>> strategy
>>>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>> use.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For the HASH
>>>> partitioning strategy I thought it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>> work
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> explained. It would be
>>>> N = MOD(expr, num). I am not
>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>> though if
>>>>>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should introduce the
>>>> PARTITIONS clause. Usually
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data and the partitions
>>>> are already an intrinsic
>>>>>>>>>>>> property
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> underlying source e.g.
>>>> for kafka we do not create
>>>>>>>>>>>> topics,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> describe pre-existing
>>>> pre-partitioned topic.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. timestamp vs
>>>> timestamp.field vs
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector.field vs
>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am fine with changing
>>>> it to timestamp.field to be
>>>>>>>>>>>>>>>> consistent
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> other value.fields and
>>>> key.fields. Actually that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>> also
>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>>>> initial
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal in a first
>>>> draft I prepared. I changed it
>>>>>>>>>>>>>> afterwards
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> shorten
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the key.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 03/03/2020 09:00,
>>>> Danny Chan wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Dawid for
>>>> bringing up this discussion, I
>>>>>>>> think
>>>>>>>>>>>> it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> useful
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> feature ~
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About how the
>>>> metadata outputs from source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think it is
>>>> completely orthogonal, computed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>> push
>>>>>>>>>>>>>>>>>> down is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> another topic, this
>>>> should not be a blocker but a
>>>>>>>>>>>>>> promotion,
>>>>>>>>>>>>>>>>>> if we
>>>>>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have any filters on the
>>>> computed column, there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is no
>>>>>>>>>>>> need
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> do any
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushings; the source
>>>> node just emit the complete
>>>>>>>> record
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> full
>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with the declared
>>>> physical schema, then when
>>>>>>>> generating
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> virtual
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> columns, we would
>>>> extract the metadata info and
>>>>>>>> output
>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> full
>>>>>>>>>>>>>>>>>>>>>>>>>>>> columns(with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> full schema).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About the type of
>>>> metadata column
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Personally i prefer
>>>> explicit type instead of CAST,
>>>>>>>>>>>> they
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>>>>> symantic
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> equivalent though,
>>>> explict type is more
>>>>>>>>>>>> straight-forward
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> we can
>>>>>>>>>>>>>>>>>>>>>>>>>>>> declare
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the nullable attribute
>>>> there.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> About option A:
>>>> partitioning based on acomputed
>>>>>>>> column
>>>>>>>>>>>>> VS
>>>>>>>>>>>>>>>>>> option
>>>>>>>>>>>>>>>>>>>>>> B:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioning with just
>>>> a function
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      From the FLIP,
>>>> it seems that B's
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioning is
>>>>>>>>>>>>> just
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> strategy
>>>>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> writing data, the
>>>> partiton column is not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> included in
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>> schema,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's just useless when
>>>> reading from that.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Compared to A, we
>>>> do not need to generate the
>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selecting from the
>>>> table(but insert into)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - For A we can also
>>>> mark the column as STORED when
>>>>>>>> we
>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So in my opition they
>>>> are orthogonal, we can
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>> both, i
>>>>>>>>>>>>>>>>>> saw
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MySQL/Oracle[1][2]
>>>> would suggest to also define the
>>>>>>>>>>>>>>>> PARTITIONS
>>>>>>>>>>>>>>>>>>>>>> num, and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitions are managed
>>>> under a "tablenamespace",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> partition
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> record is stored is
>>>> partition number N, where N =
>>>>>>>>>>>>> MOD(expr,
>>>>>>>>>>>>>>>>>> num),
>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design, which partiton
>>>> the record would persist ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> https://dev.mysql.com/doc/refman/5.7/en/partitioning-hash.html
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://docs.oracle.com/database/121/VLDBG/GUID-F023D3ED-262F-4B19-950A-D3C8F8CDB4F4.htm#VLDBG1270
>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Danny Chan
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 在 2020年3月2日 +0800
>>>> PM6:16，Dawid Wysakowicz <
>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ，写道：
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ad. 2 I added a
>>>> section to discuss relation to
>>>>>>>>>>>> FLIP-63
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ad. 3 Yes, I also
>>>> tried to somewhat keep
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> hierarchy
>>>>>>>> of
>>>>>>>>>>>>>>>>>> properties.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore you have the
>>>> key.format.type.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also considered
>>>> exactly what you are suggesting
>>>>>>>>>>>>>>>> (prefixing
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector or kafka). I
>>>> should've put that into an
>>>>>>>>>>>>>>>>>> Option/Rejected
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternatives.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree timestamp,
>>>> key.*, value.* are connector
>>>>>>>>>>>>>> properties.
>>>>>>>>>>>>>>>>>> Why I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wanted to suggest not
>>>> adding that prefix in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first
>>>>>>>>>>>>>> version
>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> actually all the
>>>> properties in the WITH section are
>>>>>>>>>>>>>> connector
>>>>>>>>>>>>>>>>>>>>>>>>>>>> properties.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Even format is in the
>>>> end a connector property as
>>>>>>>> some
>>>>>>>>>>>> of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> sources
>>>>>>>>>>>>>>>>>>>>>>>>>>>> might
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not have a format, imo.
>>>> The benefit of not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> adding the
>>>>>>>>>>>>>> prefix
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> that it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> makes the keys a bit
>>>> shorter. Imagine prefixing all
>>>>>>>> the
>>>>>>>>>>>>>>>>>> properties
>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector (or if we go
>>>> with FLINK-12557:
>>>>>>>>>>>> elasticsearch):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> elasticsearch.key.format.type: csv
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> elasticsearch.key.format.field: ....
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> elasticsearch.key.format.delimiter: ....
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> elasticsearch.key.format.*: ....
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am fine with
>>>> doing it though if this is a
>>>>>>>> preferred
>>>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> community.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ad in-line comments:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I forgot to update
>>>> the `value.fields.include`
>>>>>>>>>>>> property.
>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> value.fields-include.
>>>> Which I think you also
>>>>>>>> suggested
>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> comment,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the cast vs
>>>> declaring output type of
>>>>>>>> computed
>>>>>>>>>>>>>>>> column.
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's better not to use
>>>> CAST, but declare a type
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>>>> expression
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>> later
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on infer the output
>>>> type of SYSTEM_METADATA. The
>>>>>>>> reason
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>> way
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it will be easier to
>>>> implement e.g. filter push
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> downs
>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> native types of the
>>>> source, e.g. in case of Kafka's
>>>>>>>>>>>>>> offset, i
>>>>>>>>>>>>>>>>>>>>>> think it's
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better to pushdown long
>>>> rather than string. This
>>>>>>>> could
>>>>>>>>>>>>> let
>>>>>>>>>>>>>> us
>>>>>>>>>>>>>>>>>> push
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> expression like e.g.
>>>> offset > 12345 & offset <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 59382.
>>>>>>>>>>>>>>>>>> Otherwise we
>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have to push down
>>>> cast(offset, long) > 12345 &&
>>>>>>>>>>>>>> cast(offset,
>>>>>>>>>>>>>>>>>> long)
>>>>>>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 59382.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Moreover I think we
>>>> need to introduce the type for
>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>> columns
>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to support functions
>>>> that infer output type
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> based on
>>>>>>>>>>>>>> expected
>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>>>>>>> type.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the computed
>>>> column push down. Yes,
>>>>>>>>>>>>>> SYSTEM_METADATA
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to be pushed down to
>>>> the source. If it is not
>>>>>>>> possible
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> planner
>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fail. As far as I know
>>>> computed columns push down
>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rework, won't it? ;)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the
>>>> persisted computed column. I think
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>>> completely
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> orthogonal. In my
>>>> current proposal you can also
>>>>>>>>>>>> partition
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column. The difference
>>>> between using a udf in
>>>>>>>>>>>> partitioned
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> vs
>>>>>>>>>>>>>>>>>>>>>>>>>>>> partitioned
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by a computed column is
>>>> that when you partition
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by a
>>>>>>>>>>>>>> computed
>>>>>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> column must be also
>>>> computed when reading the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>> If
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> use a
>>>>>>>>>>>>>>>>>>>>>> udf in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the partitioned by, the
>>>> expression is computed only
>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>> inserting
>>>>>>>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hope this answers
>>>> some of your questions. Looking
>>>>>>>>>>>>> forward
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> further
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> suggestions.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 02/03/2020
>>>> 05:18, Jark Wu wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks Dawid for
>>>> starting such a great
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>> Reaing
>>>>>>>>>>>>>>>>>>>>>> metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> key-part
>>>> information from source is an important
>>>>>>>>>>>>> feature
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In general, I
>>>> agree with the proposal of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will leave my
>>>> thoughts and comments here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) +1 to use
>>>> connector properties instead of
>>>>>>>>>>>>> introducing
>>>>>>>>>>>>>>>>>> HEADER
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keyword as
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the reason you
>>>> mentioned in the FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) we already
>>>> introduced PARTITIONED BY in
>>>>>>>> FLIP-63.
>>>>>>>>>>>>>> Maybe
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> add a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> section to
>>>> explain what's the relationship
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> between
>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do their concepts
>>>> conflict? Could INSERT
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PARTITION
>>>>>>>>>>>> be
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PARTITIONED table
>>>> in this FLIP?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Currently,
>>>> properties are hierarchical in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
>>>>>>>>>>>>> SQL.
>>>>>>>>>>>>>>>>>> Shall we
>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> new introduced
>>>> properties more hierarchical?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example,
>>>> "timestamp" =>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>> "connector.timestamp"?
>>>>>>>>>>>>>>>>>> (actually, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "kafka.timestamp"
>>>> which is another
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement for
>>>>>>>>>>>>>>>>>> properties
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLINK-12557)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A single
>>>> "timestamp" in properties may mislead
>>>>>>>> users
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>> field
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a rowtime
>>>> attribute.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I also left some
>>>> minor comments in the FLIP.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, 1 Mar
>>>> 2020 at 22:30, Dawid Wysakowicz <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to
>>>> propose an improvement that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> enable
>>>>>>>>>>>>>>>>>>>>>> reading
>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> columns from
>>>> different parts of source records.
>>>>>>>>>>>>> Besides
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> main
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> payload
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> majority (if
>>>> not all of the sources) expose
>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>>>>>>>>>>> information. It
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can be simply a
>>>> read-only metadata such as
>>>>>>>> offset,
>>>>>>>>>>>>>>>>>> ingestion
>>>>>>>>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>>>>>>>>>>>>>> or a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> read and write
>>>> parts of the record that contain
>>>>>>>>>>>> data
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>>> additionally
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> serve different
>>>> purposes (partitioning,
>>>>>>>> compaction
>>>>>>>>>>>>>> etc.),
>>>>>>>>>>>>>>>>>> e.g.
>>>>>>>>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timestamp in
>>>> Kafka.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> We should make
>>>> it possible to read and write
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> locations. In
>>>> this proposal I discuss reading
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>>>>>> data,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> completeness
>>>> this proposal discusses also the
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> writing
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data out.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am looking
>>>> forward to your comments.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can access
>>>> the FLIP here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Reading+table+columns+from+different+parts+of+source+records?src=contextnavpagetreemode
>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dawid
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>

Re: [DISCUSS] FLIP-107: Reading table columns from different parts of source records

Reply via email to