Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> s
bodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no c
or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 7 May 2023
ain in the second run. You can
> also confirm it in other metrics from Spark UI.
>
> That is my personal understanding based on what I have read and seen on my
> job runs. If there is any mistake, be free to correct me.
>
> Thank You & Best Regards
> Winston Lai
> --
Pyspark version:3.1.3
*Question 1: *What is DataFilters in spark physical plan? How is it
different from PushedFilters?
*Question 2:* When joining two datasets, Why is the filter isnotnull
applied twice on the joining key column? In the physical plan, it is once
applied as a PushedFilter and then
I am reading two datasets that I saved to the disk with ```bucketBy```
option on the same key with the same number of partitions. When I read them
back and join them, they should not result in a shuffle.
But, that isn't the case I am seeing.
*The following code demonstrates the alleged
I understand pandasUDF as follows:
1. There are multiple partitions per worker
2. Multiple arrow batches are converted per partition
3. Sent to python process
4. In the case of Series to Series the pandasUDF is applied to each arrow
batch one after the other? **(So, is it that (a) - The