The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar
Is there a simple set of rules to follow when deciding between vectorized vs scalar PyFlink UDF? According to [docs]( https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/table-api-users-guide/udfs/vectorized_python_udfs.html), vectorized UDF has advantages of: (1) smaller ser-de and invocation overhead (2) Vector calculation are highly optimized thanks to libs such as Numpy. > Vectorized Python user-defined functions are functions which are executed by transferring a batch of elements between JVM and Python VM in Arrow columnar format. The performance of vectorized Python user-defined functions are usually much higher than non-vectorized Python user-defined functions as the serialization/deserialization overhead and invocation overhead are much reduced. Besides, users could leverage the popular Python libraries such as Pandas, Numpy, etc for the vectorized Python user-defined functions implementation. These Python libraries are highly optimized and provide high-performance data structures and functions. **QUESTION 1**: Is vectorized UDF ALWAYS preferred? Let's say, in my use case, I want to simply extract some fields from a JSON column, that is not supported by Flink [built-in functions]( https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/functions/systemFunctions.html) yet, therefore I need to define my udf like: ```python @udf(...) def extract_field_from_json(json_value, field_name): import json return json.loads(json_value)[field_name] ``` **QUESTION 2**: Will I also benefit from vectorized UDF in this case? Best, Yik San