Please find the answers inline please . 1) Can I apply predicate pushdown filters if I have data stored in S3 or it should be used only while reading from DBs? it can be applied in s3 if you store parquet , csv, json or in avro format .It does not depend on the DB , its supported in object store like s3 as well .
2) While running the data in distributed form, is my code copied to each and every executor. As per me, it should be the case since code.zip would be smaller in size to be copied on each worker node. if you are trying to join two datasets out of which one is small , Spark by default would try to broadcast the smaller data set to the other executor , rather going for a Sort merge Join , There is property which is enabled by default from spark 3.1 , the limit for smaller dataframe to be broadcasted is 10 MB , it can also be changed to higher value with config . 3) Also my understanding of shuffling of data is " It is moving one partition to another partition or moving data(keys) of one partition to another partition of those keys. It increases memory since before shuffling it copies the data in the memory and then transfers to another partition". Is it correct? If not, please correct me. It depends on the context of Distributed computing as Your data does not sit in one machine , neither in one Disk . Shuffle is involved when you try to trigger actions like Group by or Sort as it involves bringing all the keys into one executor Do the computation , or when Sort merge Join is triggered then both the dataset Sorted and this sort is Global sort not partition wise sort . yes its memory intensive operation as , if you see a lot of shuffle to be involved best to use SSD (M5d based machine in AWS ) . As for really big jobs where TB worth of data has to be joined its not possible to do all the operation in memory in RAM Hope that helps . Best Tufan On Sat, 25 Jun 2022 at 08:43, Sid <flinkbyhe...@gmail.com> wrote: > Hi Team, > > I have various doubts as below: > > 1) Can I apply predicate pushdown filters if I have data stored in S3 or > it should be used only while reading from DBs? > > 2) While running the data in distributed form, is my code copied to each > and every executor. As per me, it should be the case since code.zip would > be smaller in size to be copied on each worker node. > > 3) Also my understanding of shuffling of data is " It is moving one > partition to another partition or moving data(keys) of one partition to > another partition of those keys. It increases memory since before shuffling > it copies the data in the memory and then transfers to another partition". > Is it correct? If not, please correct me. > > Please help me to understand these things in layman's terms if my > assumptions are not correct. > > Thanks, > Sid >