Hi Tufan,

Thanks for the answers. However, by the second point, I mean to say where
would my code reside? Will it be copied to all the executors since the code
size would be small or will it be maintained on the driver's side? I know
that driver converts the code to DAG and when an action is called it is
submitted to the DAG scheduler and so on...

Thanks,
Sid

On Sat, Jun 25, 2022 at 12:34 PM Tufan Rakshit <tufan...@gmail.com> wrote:

> Please find the answers inline please .
> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
> it should be used only while reading from DBs?
> it can be applied in s3 if you store parquet , csv, json or in avro format
> .It does not depend on the DB , its supported in object store like s3 as
> well .
>
> 2) While running the data in distributed form, is my code copied to each
> and every executor. As per me, it should be the case since code.zip would
> be smaller in size to be copied on each worker node.
> if  you are trying to join two datasets out of which one is small , Spark
> by default would try to broadcast the smaller data set to the other
> executor , rather going for a Sort merge Join , There is property which is
> enabled by default from spark 3.1 , the limit for smaller dataframe to be
> broadcasted is 10 MB , it can also be changed  to higher value with config .
>
> 3) Also my understanding of shuffling of data is " It is moving one
> partition to another partition or moving data(keys) of one partition to
> another partition of those keys. It increases memory since before shuffling
> it copies the data in the memory and then transfers to another partition".
> Is it correct? If not, please correct me.
>
> It depends on the context of Distributed computing as Your data does not
> sit in one machine , neither in one Disk . Shuffle is involved when you try
> to trigger actions like Group by or Sort as it involves bringing all the
> keys into one executor Do the computation , or when Sort merge Join is
> triggered then both the dataset Sorted and this sort is Global sort not
> partition wise sort . yes its memory intensive operation as , if you see a
> lot of shuffle to be involved best to use SSD (M5d based machine in AWS ) .
> As for really big jobs where TB worth of data has to be joined its not
> possible to do all the operation in memory in RAM
>
>
> Hope that helps .
>
> Best
> Tufan
>
>
>
> On Sat, 25 Jun 2022 at 08:43, Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I have various doubts as below:
>>
>> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
>> it should be used only while reading from DBs?
>>
>> 2) While running the data in distributed form, is my code copied to each
>> and every executor. As per me, it should be the case since code.zip would
>> be smaller in size to be copied on each worker node.
>>
>> 3) Also my understanding of shuffling of data is " It is moving one
>> partition to another partition or moving data(keys) of one partition to
>> another partition of those keys. It increases memory since before shuffling
>> it copies the data in the memory and then transfers to another partition".
>> Is it correct? If not, please correct me.
>>
>> Please help me to understand these things in layman's terms if my
>> assumptions are not correct.
>>
>> Thanks,
>> Sid
>>
>

Reply via email to