Re: [DISCUSS] What parts of the Python API should we focus on next ?

Bowen Li Thu, 19 Dec 2019 08:55:43 -0800

- integrate PyFlink with Jupyter notebook
   - Description: users should be able to run PyFlink seamlessly in Jupyter
   - Benefits: Jupyter is the industrial standard notebook for data
scientists. I’ve talked to a few companies in North America, they think
Jupyter is the #1 way to empower internal DS with Flink



On Wed, Dec 18, 2019 at 19:05 jincheng sun <sunjincheng...@gmail.com> wrote:

> Also CC user-zh.
>
> Best,
> Jincheng
>
>
> jincheng sun <sunjincheng...@gmail.com> 于2019年12月19日周四 上午10:20写道：
>
>> Hi folks,
>>
>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>> already supported), it is time for us to plan the features of PyFlink for
>> the next release.
>>
>> To make sure the features supported in PyFlink are the mostly demanded
>> for the community, we'd like to get more people involved, i.e., it would be
>> better if all of the devs and users join in the discussion of which kind of
>> features are more important and urgent.
>>
>> We have already listed some features from different aspects which you can
>> find below, however it is not the ultimate plan. We appreciate any
>> suggestions from the community, either on the functionalities or
>> performance improvements, etc. Would be great to have the following
>> information if you want to suggest to add some features:
>>
>> ---------
>> - Feature description: xxxx
>> - Benefits of the feature: xxxx
>> - Use cases (optional): xxxx
>> ----------
>>
>> ----Features in my mind----
>>
>> 1. Integration with most popular Python libraries
>>     - fromPandas/toPandas API
>>        Description:
>>           Support to convert between Table and pandas.DataFrame.
>>        Benefits:
>>           Users could switch between Flink and Pandas API, for example,
>> do some analysis using Flink and then perform analysis using the Pandas API
>> if the result data is small and could fit into the memory, and vice versa.
>>
>>     - Support Scalar Pandas UDF
>>        Description:
>>           Support scalar Pandas UDF in Python Table API & SQL. Both the
>> input and output of the UDF is pandas.Series.
>>        Benefits:
>>           1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>> ranging from 3x to over 100x (from pyspark)
>>           2) Users could use Pandas/Numpy API in the Python UDF
>> implementation if the input/output data type is pandas.Series
>>
>>     - Support Pandas UDAF in batch GroupBy aggregation
>>        Description:
>>            Support Pandas UDAF in batch GroupBy aggregation of Python
>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>>        Benefits:
>>           1) Pandas UDAF performs better than row-at-a-time UDAF more
>> than 10x in certain scenarios
>>           2) Users could use Pandas/Numpy API in the Python UDAF
>> implementation if the input/output data type is pandas.DataFrame
>>
>> 2. Fully support  all kinds of Python UDF
>>     - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>> give us some use case if you want this feature to be contained in the next
>> release)
>>       Description:
>>         Support UDAF in GroupBy aggregation.
>>       Benefits:
>>         Users could define and use Python UDAF and use it in GroupBy
>> aggregation. Without it, users have to use Java/Scala UDAF.
>>
>>     - Support Python UDTF
>>       Description:
>>    Support  Python UDTF in Python Table API & SQL
>>       Benefits:
>>         Users could define and use Python UDTF in Python Table API & SQL.
>> Without it, users have to use Java/Scala UDTF.
>>
>> 3. Debugging and Monitoring of Python UDF
>>    - Support User-Defined Metrics
>>      Description:
>>        Allow users to define user-defined metrics and global job
>> parameters with Python UDFs.
>>      Benefits:
>>        UDF needs metrics to monitor some business or technical
>> indicators, which is also a requirement for UDFs.
>>
>>    - Make the log level configurable
>>      Description:
>>        Allow users to config the log level of Python UDF.
>>      Benefits:
>>        Users could configure different log levels when debugging and
>> deploying.
>>
>> 4. Enrich the Python execution environment
>>    - Docker Mode Support
>>      Description:
>>          Support running python UDF in docker workers.
>>      Benefits:
>>          Support various of deployments to meet more users' requirements.
>>
>> 5. Expand the usage scope of Python UDF
>>    - Support to use Python UDF via SQL client
>>      Description:
>>          Support to register and use Python UDF via SQL client
>>      Benefits:
>>          SQL client is a very important interface for SQL users. This
>> feature allows SQL users to use Python UDFs via SQL client.
>>
>>    - Integrate Python UDF with Notebooks
>>      Description:
>>          Such as Zeppelin, etc (Especially Python dependencies)
>>
>>    - Support to register Python UDF into catalog
>>       Description:
>>           Support to register Python UDF into catalog
>>       Benefits:
>>           1）Catalog is the centralized place to manage metadata such as
>> tables, UDFs, etc. With it, users could register the UDFs once and use it
>> anywhere.
>>           2) It's an important part of the SQL functionality. If Python
>> UDFs are not supported to be registered and used in catalog, Python UDFs
>> could not be shared between jobs.
>>
>> 6. Performance Improvements of Python UDF
>>    - Cython improvements
>>       Description:
>>           Cython Improvements in coder & operations
>>       Benefits:
>>           Initial tests show that Cython will speed 3x+ in coder
>> serialization/deserialization.
>>
>> 7. Add Python ML API
>>    - Add Python ML Pipeline API
>>      Description:
>>          Align Python ML Pipeline API with Java/Scala
>>      Benefits:
>>        1) Currently, we already have the Pipeline APIs for ML. It would
>> be good to also have the related Python APIs.
>>        2) In many cases, algorithm engineers prefer the Python language.
>>
>>
>> BTW, the PyFlink is a new component, and there are still a lot of work
>> need to do. Thus, everybody is cordially welcome to join the contribution
>> to PyFlink, including asking questions, filing bug reports, proposing new
>> features, joining discussions, contributing code or documentation ...
>>
>> Hope to see your feedback!
>>
>> Best,
>> Jincheng
>>
>

Re: [DISCUSS] What parts of the Python API should we focus on next ?

回复