Re: Usage of PyArrow in Spark

2019-07-18 Thread Bryan Cutler
It would be possible to use arrow on regular python udfs and avoid pandas, and there would probably be some performance improvement. The difficult part will be to ensure that the data remains consistent in the conversions between Arrow and Python, e.g. timestamps are a bit tricky. Given that we

Re: Usage of PyArrow in Spark

2019-07-18 Thread Abdeali Kothari
I was thinking of implementing that. But quickly realized that doing a conversion of Spark -> Pandas -> Python causes errors. A quick example being "None" in Numeric data types. Pandas supports only NaN. Spark supports NULL and NaN. This is just one of the issues I came to. I'm not sure about

Re: Usage of PyArrow in Spark

2019-07-17 Thread Hyukjin Kwon
Regular Python UDFs don't use PyArrow under the hood. Yes, they can potentially benefit but they can be easily worked around via Pandas UDFs. For instance, both below are virtually identical. @udf(...) def func(col): return col @pandas_udf(...) def pandas_func(col): return

Usage of PyArrow in Spark

2019-07-16 Thread Abdeali Kothari
Hi, In spark 2.3+ I saw that pyarrow was being used in a bunch of places in spark. And I was trying to understand the benefit in terms of serialization / deserializaiton it provides. I understand that the new pandas-udf works only if pyarrow is installed. But what about the plain old PythonUDF