Re: Spark Doubts

russell . spitzer Sat, 25 Jun 2022 08:45:11 -0700

Code is always distributed for any operations on a DataFrame or RDD. The size 
of your code is irrelevant except to Jvm memory limits. For most jobs the 
entire application jar and all dependencies are put on the classpath of every 
executor.


There are some exceptions but generally you should think about all data 
processing occurring on executor Jvms.

Sent from my iPhone

> On Jun 25, 2022, at 2:18 AM, Sid <flinkbyhe...@gmail.com> wrote:
> 
> 
> Hi Tufan,
> 
> Thanks for the answers. However, by the second point, I mean to say where 
> would my code reside? Will it be copied to all the executors since the code 
> size would be small or will it be maintained on the driver's side? I know 
> that driver converts the code to DAG and when an action is called it is 
> submitted to the DAG scheduler and so on...
> 
> Thanks,
> Sid
> 
>> On Sat, Jun 25, 2022 at 12:34 PM Tufan Rakshit <tufan...@gmail.com> wrote:
>> Please find the answers inline please .
>> 1) Can I apply predicate pushdown filters if I have data stored in S3 or it 
>> should be used only while reading from DBs?
>> it can be applied in s3 if you store parquet , csv, json or in avro format 
>> .It does not depend on the DB , its supported in object store like s3 as 
>> well .
>> 
>> 2) While running the data in distributed form, is my code copied to each and 
>> every executor. As per me, it should be the case since code.zip would be 
>> smaller in size to be copied on each worker node.
>> if  you are trying to join two datasets out of which one is small , Spark by 
>> default would try to broadcast the smaller data set to the other executor , 
>> rather going for a Sort merge Join , There is property which is enabled by 
>> default from spark 3.1 , the limit for smaller dataframe to be broadcasted 
>> is 10 MB , it can also be changed  to higher value with config .
>> 
>> 3) Also my understanding of shuffling of data is " It is moving one 
>> partition to another partition or moving data(keys) of one partition to 
>> another partition of those keys. It increases memory since before shuffling 
>> it copies the data in the memory and then transfers to another partition". 
>> Is it correct? If not, please correct me.
>> 
>> It depends on the context of Distributed computing as Your data does not sit 
>> in one machine , neither in one Disk . Shuffle is involved when you try to 
>> trigger actions like Group by or Sort as it involves bringing all the keys 
>> into one executor Do the computation , or when Sort merge Join is triggered 
>> then both the dataset Sorted and this sort is Global sort not partition wise 
>> sort . yes its memory intensive operation as , if you see a lot of shuffle 
>> to be involved best to use SSD (M5d based machine in AWS ) .
>> As for really big jobs where TB worth of data has to be joined its not 
>> possible to do all the operation in memory in RAM 
>> 
>> 
>> Hope that helps .
>> 
>> Best 
>> Tufan
>> 
>> 
>> 
>>> On Sat, 25 Jun 2022 at 08:43, Sid <flinkbyhe...@gmail.com> wrote:
>>> Hi Team,
>>> 
>>> I have various doubts as below:
>>> 
>>> 1) Can I apply predicate pushdown filters if I have data stored in S3 or it 
>>> should be used only while reading from DBs?
>>> 
>>> 2) While running the data in distributed form, is my code copied to each 
>>> and every executor. As per me, it should be the case since code.zip would 
>>> be smaller in size to be copied on each worker node.
>>> 
>>> 3) Also my understanding of shuffling of data is " It is moving one 
>>> partition to another partition or moving data(keys) of one partition to 
>>> another partition of those keys. It increases memory since before shuffling 
>>> it copies the data in the memory and then transfers to another partition". 
>>> Is it correct? If not, please correct me.
>>> 
>>> Please help me to understand these things in layman's terms if my 
>>> assumptions are not correct.
>>> 
>>> Thanks,
>>> Sid

Re: Spark Doubts

Reply via email to