[ https://issues.apache.org/jira/browse/FLINK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dian Fu closed FLINK-16114. --------------------------- Resolution: Resolved > Support Scalar Vectorized Python UDF in PyFlink > ----------------------------------------------- > > Key: FLINK-16114 > URL: https://issues.apache.org/jira/browse/FLINK-16114 > Project: Flink > Issue Type: New Feature > Components: API / Python > Reporter: Dian Fu > Assignee: Dian Fu > Priority: Major > Fix For: 1.11.0 > > > Scalar Python UDF has already been supported in Flink 1.10 > ([FLIP-58|https://cwiki.apache.org/confluence/display/FLINK/FLIP-58%3A+Flink+Python+User-Defined+Stateless+Function+for+Table]) > and it operates one row at a time. It works in the way that the Java > operator serializes one input row to bytes and sends them to the Python > worker; the Python worker deserializes the input row and evaluates the Python > UDF with it; the result row is serialized and sent back to the Java operator. > It suffers from the following problems: > # High serialization/deserialization overhead > # It’s difficult to leverage the popular Python libraries used by data > scientists, such as Pandas, Numpy, etc which provide high performance data > structure and functions. > We want to introduce vectorized Python UDF to address this problem. For > vectorized Python UDF, a batch of rows are transferred between JVM and Python > VM in columnar format. The batch of rows will be converted to a collection of > Pandas.Series and given to the vectorized Python UDF which could then > leverage the popular Python libraries such as Pandas, Numpy, etc for the > Python UDF implementation. > More details could be found in > [FLIP-97.|https://cwiki.apache.org/confluence/display/FLINK/FLIP-97%3A+Support+Scalar+Vectorized+Python+UDF+in+PyFlink] -- This message was sent by Atlassian Jira (v8.3.4#803005)