PyFlink UDF: When to use vectorized vs scalar

Yik San Chan Fri, 16 Apr 2021 02:04:49 -0700

The question is cross-posted on Stack Overflow
https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar


Is there a simple set of rules to follow when deciding between vectorized
vs scalar PyFlink UDF?

According to [docs](
https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html),
vectorized UDF has advantages of: (1) smaller ser-de and invocation
overhead (2) Vector calculation are highly optimized thanks to libs such as
Numpy.

> Vectorized Python user-defined functions are functions which are executed
by transferring a batch of elements between JVM and Python VM in Arrow
columnar format. The performance of vectorized Python user-defined
functions are usually much higher than non-vectorized Python user-defined
functions as the serialization/deserialization overhead and invocation
overhead are much reduced. Besides, users could leverage the popular Python
libraries such as Pandas, Numpy, etc for the vectorized Python user-defined
functions implementation. These Python libraries are highly optimized and
provide high-performance data structures and functions.

**QUESTION 1**: Is vectorized UDF ALWAYS preferred?

Let's say, in my use case, I want to simply extract some fields from a JSON
column, that is not supported by Flink [built-in functions](
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html)
yet, therefore I need to define my udf like:

```python
@udf(...)
def extract_field_from_json(json_value, field_name):
    import json
    return json.loads(json_value)[field_name]
```

**QUESTION 2**: Will I also benefit from vectorized UDF in this case?

Best,
Yik San

PyFlink UDF: When to use vectorized vs scalar

Reply via email to