Re: Spark Doubts

Tufan Rakshit Sat, 25 Jun 2022 00:04:45 -0700

Please find the answers inline please .
1) Can I apply predicate pushdown filters if I have data stored in S3 or it
should be used only while reading from DBs?
it can be applied in s3 if you store parquet , csv, json or in avro format
.It does not depend on the DB , its supported in object store like s3 as
well .

2) While running the data in distributed form, is my code copied to each
and every executor. As per me, it should be the case since code.zip would
be smaller in size to be copied on each worker node.
if  you are trying to join two datasets out of which one is small , Spark
by default would try to broadcast the smaller data set to the other
executor , rather going for a Sort merge Join , There is property which is
enabled by default from spark 3.1 , the limit for smaller dataframe to be
broadcasted is 10 MB , it can also be changed  to higher value with config .

3) Also my understanding of shuffling of data is " It is moving one
partition to another partition or moving data(keys) of one partition to
another partition of those keys. It increases memory since before shuffling
it copies the data in the memory and then transfers to another partition".
Is it correct? If not, please correct me.

It depends on the context of Distributed computing as Your data does not
sit in one machine , neither in one Disk . Shuffle is involved when you try
to trigger actions like Group by or Sort as it involves bringing all the
keys into one executor Do the computation , or when Sort merge Join is
triggered then both the dataset Sorted and this sort is Global sort not
partition wise sort . yes its memory intensive operation as , if you see a
lot of shuffle to be involved best to use SSD (M5d based machine in AWS ) .
As for really big jobs where TB worth of data has to be joined its not
possible to do all the operation in memory in RAM

Hope that helps .

Best
Tufan

On Sat, 25 Jun 2022 at 08:43, Sid <flinkbyhe...@gmail.com> wrote:

> Hi Team,
>
> I have various doubts as below:
>
> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
> it should be used only while reading from DBs?
>
> 2) While running the data in distributed form, is my code copied to each
> and every executor. As per me, it should be the case since code.zip would
> be smaller in size to be copied on each worker node.
>
> 3) Also my understanding of shuffling of data is " It is moving one
> partition to another partition or moving data(keys) of one partition to
> another partition of those keys. It increases memory since before shuffling
> it copies the data in the memory and then transfers to another partition".
> Is it correct? If not, please correct me.
>
> Please help me to understand these things in layman's terms if my
> assumptions are not correct.
>
> Thanks,
> Sid
>

Re: Spark Doubts

Reply via email to