It would be possible to use arrow on regular python udfs and avoid pandas,
and there would probably be some performance improvement. The difficult
part will be to ensure that the data remains consistent in the conversions
between Arrow and Python, e.g. timestamps are a bit tricky. Given that we
I was thinking of implementing that. But quickly realized that doing a
conversion of Spark -> Pandas -> Python causes errors.
A quick example being "None" in Numeric data types.
Pandas supports only NaN. Spark supports NULL and NaN.
This is just one of the issues I came to.
I'm not sure about
Regular Python UDFs don't use PyArrow under the hood.
Yes, they can potentially benefit but they can be easily worked around via
Pandas UDFs.
For instance, both below are virtually identical.
@udf(...)
def func(col):
return col
@pandas_udf(...)
def pandas_func(col):
return
Hi,
In spark 2.3+ I saw that pyarrow was being used in a bunch of places in
spark. And I was trying to understand the benefit in terms of serialization
/ deserializaiton it provides.
I understand that the new pandas-udf works only if pyarrow is installed.
But what about the plain old PythonUDF