Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Dian Fu Mon, 06 Apr 2020 19:51:26 -0700

Thanks you all for the discussion. It seems that we have reached consensus on 
the design. I will start a VOTE thread if there are no other feedbacks.


Regards,
Dian

> 在 2020年4月3日，下午2:58，Wei Zhong <weizhong0...@gmail.com> 写道：
> 
> Hi Dian,
> 
> Thanks for driving this. Big +1 for supporting from/to pandas in PyFlink!
> 
> Best,
> Wei
> 
>> 在 2020年4月3日，13:46，jincheng sun <sunjincheng...@gmail.com> 写道：
>> 
>> +1, Thanks for bring up this discussion @Dian Fu <dian0511...@gmail.com>
>> 
>> Best,
>> Jincheng
>> 
>> 
>> Jeff Zhang <zjf...@gmail.com> 于2020年4月1日周三 下午1:27写道：
>> 
>>> Thanks for the reply, Dian, that make sense to me.
>>> 
>>> Dian Fu <dian0511...@gmail.com> 于2020年4月1日周三 上午11:53写道：
>>> 
>>>> Hi Jeff,
>>>> 
>>>> Thanks for your feedback.
>>>> 
>>>> ArrowTableSink is a Flink sink which is responsible for collecting the
>>>> data of the table. It will serialize the data of the table to Arrow
>>> format
>>>> to make sure that it could be deserialized to pandas dataframe
>>> efficiently.
>>>> You are right that pandas dataframe is constructed at the client side and
>>>> so there needs a way to transfer the table data from the ArrowTableSink
>>> to
>>>> the client. It shares the same design as Table.collect on how to transfer
>>>> the data to the client. This is still under lively discussion in
>>>> FLINK-14807. I think we can discuss it there on this aspect and so it's
>>> not
>>>> touched in this design(already mentioned in the design doc). Then we can
>>>> focus on table/dataframe conversion in this design. Does that make sense
>>> to
>>>> you?
>>>> 
>>>> Thanks,
>>>> Dian
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-14807 <
>>>> https://issues.apache.org/jira/browse/FLINK-14807>
>>>>> 在 2020年4月1日，上午11:14，Jeff Zhang <zjf...@gmail.com> 写道：
>>>>> 
>>>>> Thanks Dian for driving this, definitely +1
>>>>> 
>>>>> Here's my 2 cents:
>>>>> 
>>>>> 1. I would pay more attention on to_pandas than from_pandas.  Because
>>>>> to_pandas will be used more frequently I believe
>>>>> 2. I think ArrowTableSink may not be enough for to_pandas, because
>>> pandas
>>>>> dataframe is on client side, it is not a table sink. We still need to
>>>>> convert ArrowTableSink to pandas dataframe if I understand correctly.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Dian Fu <dian0511...@gmail.com> 于2020年4月1日周三 上午10:49写道：
>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> I'd like to start a discussion about supporting conversion between
>>>> PyFlink
>>>>>> Table and Pandas DataFrame.
>>>>>> 
>>>>>> Pandas dataframe is the de-facto standard to work with tabular data in
>>>>>> Python community. PyFlink table is Flink’s representation of the
>>> tabular
>>>>>> data in Python language. It would be nice to provide the functionality
>>>> to
>>>>>> convert between the PyFlink table and Pandas dataframe in PyFlink
>>> Table
>>>>>> API. It provides users the ability to switch between PyFlink and
>>> Pandas
>>>>>> seamlessly when processing data in Python language without an extra
>>>>>> intermediate connectors.
>>>>>> 
>>>>>> Jincheng Sun and I have discussed offline and have drafted the
>>>>>> FLIP-120[1]. Looking forward to your feedback!
>>>>>> 
>>>>>> Regards,
>>>>>> Dian
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best Regards
>>>>> 
>>>>> Jeff Zhang
>>>> 
>>>> 
>>> 
>>> --
>>> Best Regards
>>> 
>>> Jeff Zhang
>>> 
>

Re: [DISCUSS] FLIP-120: Support conversion between PyFlink Table and Pandas DataFrame

Reply via email to