Re: [VOTE] FLIP-95: New TableSource and TableSink interfaces

Dawid Wysakowicz Fri, 03 Apr 2020 01:43:10 -0700

Hi all,

Just to make it clear. I don't want to block the whole effort. I'm big
+1 on the whole document and -0 on using the TableSchema for a
projection pushdown.


My personal opinion was that TableSchema is misleading for Projection
pushdown and I still think that way. That's why I wanted to bring it up
to see if I am the only one. If that's the case, let's just proceed with
the TableSchema.

Best,

Dawid

On 02/04/2020 17:19, Jark Wu wrote:
> Hi Timo,
>
> I don't think source should work with `CatalogTableSchema`. So far, a table
> source doesn't need to know the logic information of computed column and
> watermark.
> IMO, we should provide a method to convert from `CatalogTableSchema` into
> `TableSchema` without computed columns in source factory,
> and a source should just hold the `TableSchema`.
>
> I agree doing the intersection/diff logic is trivial, but maybe we can
> provide utilities to do that? So that we can keep the interface clean.
>
>
> Best,
> Jark
>
>
> On Thu, 2 Apr 2020 at 20:17, Timo Walther <[email protected]> wrote:
>
>> Hi Jark,
>>
>> if catalogs use `CatalogTableSchema` in the future. The source would
>> internally also work with `CatalogTableSchema`. I'm fine with cleaning
>> up the `TableSchema` class but should a source deal with two different
>> schema classes then?
>>
>> Another problem that I see is that connectors usually need to perform
>> some index arithmetics. Dealing with TableSchema and additionally within
>> a field with DataType might be a bit inconvenient. A dedicated class
>> with utilities might be helpful such that not every source needs to
>> implement the same intersection/diff logic again.
>>
>> Regards,
>> Timo
>>
>>
>> On 02.04.20 14:06, Jark Wu wrote:
>>> Hi Dawid,
>>>
>>>> How to express projections with TableSchema?
>>> The TableSource holds the original TableSchema (i.e. from DDL) and the
>>> pushed TableSchema represents the schema after projection.
>>> Thus the table source can compare them to figure out changed field orders
>>> or not matched types.
>>> For most sources who maps physical storage by field names (e.g. jdbc,
>>> hbase, json) they can just simply apply the pushed TableSchema.
>>> But sources who maps by field indexes (e.g. csv), they need to figure out
>>> the projected indexes by comparing the original and projected schema.
>>> For example, the original schema is [a: String, b: Int, c: Timestamp],
>> and
>>> b is pruned, then the pushed schema is [a: String, c: Timestamp]. So the
>>> source can figure out index=1 is pruned.
>>>
>>>> How do we express projection of a nested field with TableSchema?
>>> This is the same to the above one. For example, the original schema is
>> [rk:
>>> String, f1 Row<q1 Int, q2 Double>].
>>> If `f1.q1` is pruned, the pushed schema will be [rk: String, f1 Row<q2
>>> Double>].
>>>
>>>> TableSchema might be used at too many different places for different
>>> responsibilities.
>>> Agree. We have recognized that a structure and builder for pure table
>>> schema is required in many places. But we mixed many concepts of catalog
>>> table schema in TableSchema.
>>> IIRC, in an offline discussion of FLIP-84, we want to introduce a new
>>> `CatalogTableSchema` to represent the schema part of a DDL,
>>> and remove all the watermark, computed column information from
>> TableSchema?
>>> Then `TableSchema` can continue to serve as a pure table schema and it
>>> stays in a good package.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>>
>>>
>>> On Thu, 2 Apr 2020 at 19:39, Timo Walther <[email protected]> wrote:
>>>
>>>> Hi Dawid,
>>>>
>>>> thanks for your feedback. I agree with your concerns. I also observed
>>>> that TableSchema might be used at too many different places for
>>>> different responsibilities.
>>>>
>>>> How about we introduce a helper class for `SupportsProjectionPushDown`
>>>> and also `LookupTableSource#Context#getKeys()` to represent nested
>>>> structure of names. Data types, constraints, or computed columns are not
>>>> necessary at those locations.
>>>>
>>>> We can also add utility methods for connectors to this helper class
>>>> there to quickly figuring out differences between the original table
>>>> schema and the new one.
>>>>
>>>> SelectedFields {
>>>>
>>>>          private LogicalType orignalRowType; // set by the planner
>>>>
>>>>          private int[][] indices;
>>>>
>>>>          getNames(int... at): String[]
>>>>
>>>>          getNames(String... at): String[]
>>>>
>>>>          getIndices(int... at): int[]
>>>>
>>>>          getNames(String... at): String[]
>>>>
>>>>          toTableSchema(): TableSchema
>>>> }
>>>>
>>>> What do others think?
>>>>
>>>> Thanks,
>>>> Timo
>>>>
>>>>
>>>>
>>>> On 02.04.20 12:28, Dawid Wysakowicz wrote:
>>>>> Generally +1
>>>>>
>>>>> One slight concern I have is about the |SupportsProjectionPushDown.|I
>>>>> don't necessarily understand how can we express projections with
>>>>> TableSchema. It's unclear for me what happens when a type of a field
>>>>> changes, fields are in a different order, when types do not match. How
>>>>> do we express projection of a nested field with TableSchema?
>>>>>
>>>>> I don't think this changes the core design presented in the FLIP,
>>>>> therefore I'm fine with accepting the FLIP. I wanted to mention my
>>>>> concerns, so that maybe we can adjust the passed around structures
>>>> slightly.
>>>>> Best,
>>>>>
>>>>> Dawid
>>>>> ||
>>>>>
>>>>> On 30/03/2020 14:42, Leonard Xu wrote:
>>>>>> +1(non-binding)
>>>>>>
>>>>>> Best,
>>>>>> Leonard Xu
>>>>>>
>>>>>>> 在 2020年3月30日，16:43，Jingsong Li<[email protected]>  写道：
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong Lee
>>>>>>>
>>>>>>> On Mon, Mar 30, 2020 at 4:41 PM Kurt Young<[email protected]>  wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kurt
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 30, 2020 at 4:08 PM Benchao Li<[email protected]>
>>>> wrote:
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> Jark Wu<[email protected]>  于2020年3月30日周一 下午3:57写道：
>>>>>>>>>
>>>>>>>>>> +1 from my side.
>>>>>>>>>>
>>>>>>>>>> Thanks Timo for driving this.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>> On Mon, 30 Mar 2020 at 15:36, Timo Walther<[email protected]>
>>>> wrote:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I would like to start the vote for FLIP-95 [1], which is
>> discussed
>>>>>>>> and
>>>>>>>>>>> reached a consensus in the discussion thread [2].
>>>>>>>>>>>
>>>>>>>>>>> The vote will be open until April 2nd (72h), unless there is an
>>>>>>>>>>> objection or not enough votes.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Timo
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
>>>>>>>>>>> [2]
>>>>>>>>>>>
>>>>>>>>>>>
>> https://lists.apache.org/thread.html/r03cbce8996fd06c9b0406c9ddc0d271bd456f943f313b9261fa061f9%40%3Cdev.flink.apache.org%3E
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Benchao Li
>>>>>>>>> School of Electronics Engineering and Computer Science, Peking
>>>> University
>>>>>>>>> Tel:+86-15650713730
>>>>>>>>> Email:[email protected];[email protected]
>>>>>>>>>
>>>>>>> --
>>>>>>> Best, Jingsong Lee
>>>>
>>

signature.asc
Description: OpenPGP digital signature

Re: [VOTE] FLIP-95: New TableSource and TableSink interfaces

Reply via email to