Re: understanding iterator of series to iterator of series pandasUDF

2022-01-04 Thread Sean Owen
That's about right, but the iterator UDF is executed per partition, not worker. Series to series is just simpler for cases where init does not matter. On Tue, Jan 4, 2022, 12:25 PM Nitin Siwach wrote: > I think I have an understanding now. > 1. In iterator to iterator the pandasUDF is called

Re: understanding iterator of series to iterator of series pandasUDF

2022-01-04 Thread Sean Owen
Yes, the UDF gets an iterator of pandas DataFrames, so your UDF will process them one at a time. The idea is to perform any expensive init once per partition, once before many DataFrames are processed, rather than before each DataFrame. The Arrow conversion is the same in either case. The benefit

Query regarding kafka version

2022-01-04 Thread Renu Yadav
Hi Team, I am using spark 2.2 , so can I use kafka version 2.5 in my spark streaming application? Thanks & Regards, Renu Yadav

understanding iterator of series to iterator of series pandasUDF

2022-01-04 Thread Nitin Siwach
I understand pandasUDF as follows: 1. There are multiple partitions per worker 2. Multiple arrow batches are converted per partition 3. Sent to python process 4. In the case of Series to Series the pandasUDF is applied to each arrow batch one after the other? **(So, is it that (a) - The