Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Dian Fu Mon, 04 Jan 2021 19:58:53 -0800

Thanks a lot for your comments!

Regarding to Python Table API examples: I thought it should be straightforward 
about how to use these operations in Python Table API and so have not added 
them. However, the suggestions make sense to me and I have added some examples 
about how to use them in Python Table API to make it more clear.


Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more 
consistent with the feature/concept which is already documented clearly in 
Flink.

Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of 
fillna for now. Compared to coalesce, fillna could handle multiple columns in 
one method call. For the naming convention, the name "fillna/dropna/replace" 
comes from Pandas [1][2][3].

Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: 
Good catch! Definitely we should support them in Table API. I will update the 
FLIP about these functionalities.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html 
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
[2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html 
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
[3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html 
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
> 在 2021年1月4日，下午10:59，Timo Walther <[email protected]> 写道：
> 
> Hi Dian,
> 
> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet 
> but will do so shortly. In general, we should aim to make the Table API as 
> concise and self-explaining as possible. E.g. `dropna` does not sound obvious 
> to me.
> 
> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more 
> top-level functions, maybe we should also consider introducing more building 
> blocks e.g. for applying an expression to every column. A more functional 
> approach (e.g. with lamba function) could solve more use cases.
> 
> Regards,
> Timo
> 
> On 04.01.21 15:35, Seth Wiesman wrote:
>> This makes sense, I have some questions about method names.
>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>> think that drop is the right word to use for this operation, it implies
>> records are filtered where this operator actually issues updates and
>> retractions. Also, deduplicate is already how we talk about this feature in
>> the docs so I think it would be easier for users to find.
>> For null handling, I don't know how close we want to stick with SQL
>> conventions but what about making `coalesce` a top-level method? Something
>> like:
>> myTable.coalesce($("a"), 1).as("a")
>> We can require the next method to be an `as`. There is already precedent
>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>> `select`.
>> Seth
>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[email protected]> wrote:
>>> Hi Dian,
>>> 
>>> Big +1 for making the Table API easier to use. Java users and Python users
>>> can both benefit from it. I think it would be better if we add some Python
>>> API examples.
>>> 
>>> Best,
>>> Wei
>>> 
>>> 
>>>> 在 2021年1月4日，20:03，Dian Fu <[email protected]> 写道：
>>>> 
>>>> Hi all,
>>>> 
>>>> I'd like to start a discussion about introducing a few convenient
>>> operations in Table API from the perspective of ease of use.
>>>> 
>>>> Currently some tasks are not easy to express in Table API e.g.
>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>> columns in a table, e.g. null data handling, etc.
>>>> 
>>>> I'd like to propose to introduce a few operations in Table API with the
>>> following purposes:
>>>> - Make Table API users to easily leverage the powerful features already
>>> in SQL, e.g. deduplication, topn, etc
>>>> - Provide some convenient operations, e.g. introducing a series of
>>> operations for null data handling (it may become a problem when there are
>>> hundreds of columns), data sampling and splitting (which is a very common
>>> use case in ML which usually needs to split a table into multiple tables
>>> for training and validation separately).
>>>> 
>>>> Please refer to FLIP-155 [1] for more details.
>>>> 
>>>> Looking forward to your feedback!
>>>> 
>>>> Regards,
>>>> Dian
>>>> 
>>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>> 
>>> 
>

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Reply via email to