- integrate PyFlink with Jupyter notebook - Description: users should be able to run PyFlink seamlessly in Jupyter - Benefits: Jupyter is the industrial standard notebook for data scientists. I’ve talked to a few companies in North America, they think Jupyter is the #1 way to empower internal DS with Flink
On Wed, Dec 18, 2019 at 19:05 jincheng sun <sunjincheng...@gmail.com> wrote: > Also CC user-zh. > > Best, > Jincheng > > > jincheng sun <sunjincheng...@gmail.com> 于2019年12月19日周四 上午10:20写道: > >> Hi folks, >> >> As release-1.10 is under feature-freeze(The stateless Python UDF is >> already supported), it is time for us to plan the features of PyFlink for >> the next release. >> >> To make sure the features supported in PyFlink are the mostly demanded >> for the community, we'd like to get more people involved, i.e., it would be >> better if all of the devs and users join in the discussion of which kind of >> features are more important and urgent. >> >> We have already listed some features from different aspects which you can >> find below, however it is not the ultimate plan. We appreciate any >> suggestions from the community, either on the functionalities or >> performance improvements, etc. Would be great to have the following >> information if you want to suggest to add some features: >> >> --------- >> - Feature description: xxxx >> - Benefits of the feature: xxxx >> - Use cases (optional): xxxx >> ---------- >> >> ----Features in my mind---- >> >> 1. Integration with most popular Python libraries >> - fromPandas/toPandas API >> Description: >> Support to convert between Table and pandas.DataFrame. >> Benefits: >> Users could switch between Flink and Pandas API, for example, >> do some analysis using Flink and then perform analysis using the Pandas API >> if the result data is small and could fit into the memory, and vice versa. >> >> - Support Scalar Pandas UDF >> Description: >> Support scalar Pandas UDF in Python Table API & SQL. Both the >> input and output of the UDF is pandas.Series. >> Benefits: >> 1) Scalar Pandas UDF performs better than row-at-a-time UDF, >> ranging from 3x to over 100x (from pyspark) >> 2) Users could use Pandas/Numpy API in the Python UDF >> implementation if the input/output data type is pandas.Series >> >> - Support Pandas UDAF in batch GroupBy aggregation >> Description: >> Support Pandas UDAF in batch GroupBy aggregation of Python >> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame. >> Benefits: >> 1) Pandas UDAF performs better than row-at-a-time UDAF more >> than 10x in certain scenarios >> 2) Users could use Pandas/Numpy API in the Python UDAF >> implementation if the input/output data type is pandas.DataFrame >> >> 2. Fully support all kinds of Python UDF >> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please >> give us some use case if you want this feature to be contained in the next >> release) >> Description: >> Support UDAF in GroupBy aggregation. >> Benefits: >> Users could define and use Python UDAF and use it in GroupBy >> aggregation. Without it, users have to use Java/Scala UDAF. >> >> - Support Python UDTF >> Description: >> Support Python UDTF in Python Table API & SQL >> Benefits: >> Users could define and use Python UDTF in Python Table API & SQL. >> Without it, users have to use Java/Scala UDTF. >> >> 3. Debugging and Monitoring of Python UDF >> - Support User-Defined Metrics >> Description: >> Allow users to define user-defined metrics and global job >> parameters with Python UDFs. >> Benefits: >> UDF needs metrics to monitor some business or technical >> indicators, which is also a requirement for UDFs. >> >> - Make the log level configurable >> Description: >> Allow users to config the log level of Python UDF. >> Benefits: >> Users could configure different log levels when debugging and >> deploying. >> >> 4. Enrich the Python execution environment >> - Docker Mode Support >> Description: >> Support running python UDF in docker workers. >> Benefits: >> Support various of deployments to meet more users' requirements. >> >> 5. Expand the usage scope of Python UDF >> - Support to use Python UDF via SQL client >> Description: >> Support to register and use Python UDF via SQL client >> Benefits: >> SQL client is a very important interface for SQL users. This >> feature allows SQL users to use Python UDFs via SQL client. >> >> - Integrate Python UDF with Notebooks >> Description: >> Such as Zeppelin, etc (Especially Python dependencies) >> >> - Support to register Python UDF into catalog >> Description: >> Support to register Python UDF into catalog >> Benefits: >> 1)Catalog is the centralized place to manage metadata such as >> tables, UDFs, etc. With it, users could register the UDFs once and use it >> anywhere. >> 2) It's an important part of the SQL functionality. If Python >> UDFs are not supported to be registered and used in catalog, Python UDFs >> could not be shared between jobs. >> >> 6. Performance Improvements of Python UDF >> - Cython improvements >> Description: >> Cython Improvements in coder & operations >> Benefits: >> Initial tests show that Cython will speed 3x+ in coder >> serialization/deserialization. >> >> 7. Add Python ML API >> - Add Python ML Pipeline API >> Description: >> Align Python ML Pipeline API with Java/Scala >> Benefits: >> 1) Currently, we already have the Pipeline APIs for ML. It would >> be good to also have the related Python APIs. >> 2) In many cases, algorithm engineers prefer the Python language. >> >> >> BTW, the PyFlink is a new component, and there are still a lot of work >> need to do. Thus, everybody is cordially welcome to join the contribution >> to PyFlink, including asking questions, filing bug reports, proposing new >> features, joining discussions, contributing code or documentation ... >> >> Hope to see your feedback! >> >> Best, >> Jincheng >> >