Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > s

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
bodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no c

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 7 May 2023

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
ain in the second run. You can > also confirm it in other metrics from Spark UI. > > That is my personal understanding based on what I have read and seen on my > job runs. If there is any mistake, be free to correct me. > > Thank You & Best Regards > Winston Lai > --

What is DataFilters and while joining why is the filter isnotnull[joinKey] applied twice

2023-01-31 Thread Nitin Siwach
Pyspark version:3.1.3 *Question 1: *What is DataFilters in spark physical plan? How is it different from PushedFilters? *Question 2:* When joining two datasets, Why is the filter isnotnull applied twice on the joining key column? In the physical plan, it is once applied as a PushedFilter and then

bucketBy in pyspark not retaining partition information

2022-01-31 Thread Nitin Siwach
I am reading two datasets that I saved to the disk with ```bucketBy``` option on the same key with the same number of partitions. When I read them back and join them, they should not result in a shuffle. But, that isn't the case I am seeing. *The following code demonstrates the alleged

understanding iterator of series to iterator of series pandasUDF

2022-01-04 Thread Nitin Siwach
I understand pandasUDF as follows: 1. There are multiple partitions per worker 2. Multiple arrow batches are converted per partition 3. Sent to python process 4. In the case of Series to Series the pandasUDF is applied to each arrow batch one after the other? **(So, is it that (a) - The