That's about right, but the iterator UDF is executed per partition, not
worker.
Series to series is just simpler for cases where init does not matter.
On Tue, Jan 4, 2022, 12:25 PM Nitin Siwach wrote:
> I think I have an understanding now.
> 1. In iterator to iterator the pandasUDF is called
Yes, the UDF gets an iterator of pandas DataFrames, so your UDF will
process them one at a time.
The idea is to perform any expensive init once per partition, once before
many DataFrames are processed, rather than before each DataFrame.
The Arrow conversion is the same in either case. The benefit
Hi Team,
I am using spark 2.2 , so can I use kafka version 2.5 in my spark streaming
application?
Thanks & Regards,
Renu Yadav
I understand pandasUDF as follows:
1. There are multiple partitions per worker
2. Multiple arrow batches are converted per partition
3. Sent to python process
4. In the case of Series to Series the pandasUDF is applied to each arrow
batch one after the other? **(So, is it that (a) - The