I understand pandasUDF as follows: 1. There are multiple partitions per worker 2. Multiple arrow batches are converted per partition 3. Sent to python process 4. In the case of Series to Series the pandasUDF is applied to each arrow batch one after the other? **(So, is it that (a) - The vectorisation is at the arrow batch level but each batch, in turn, is processed sequentially by the worker. Or, is it that (b) - The arrow batches are combined after all have arrived and then the pandasUDF is applied to the whole?)** I think it is (b). i.e. the arrow batches are combined. I have given my reasoning below
Given this understanding and blackbishop's answer I have the following further questions: *How exactly is Iterator versions of pandasUDFs working?* 1. If there is some expensive initialization then why can we not do that in the case of series to series pandasUDF as well. In the case of iterator of series to iterator of series the initialization is done and is shared across all the workers and used for all the arrow batches. Why can not the same process be followed for a series to series pandasUDF? initialize --> Share to workers --> once all the arrow batches are combined on a worker, Apply? 2. I can see that we might want to separate out the execution of i/o and python code on arrow batches so as one batch is being read in the pandasUDF is being run on the previous batch. (Why is this not done in the case of series to series? **This is why I think all the arrow batches are combined before running them through the pandasUDF. Because, otherwise the same i/o parallelization benefits are available for series to series pandasUDF as well** One more question: 1. Since the output is an Iterator of Series, where is the vectorisation then? Is it that the pandasUDF is run on an entire arrow batch and then the result is emitted row by row? Or, is the pandasUDF processing the arrow batches row by row and then emitting the result (This loses vectorisation as I see it)