Hi All,

I have to join 2 files both not very big say few MBs only but the result can be 
huge say generating 500GBs to TBs of data.  Now I have tried using spark Join() 
function but Im noticing that join is executing on only 1 or 2 nodes at the 
max. Since I have a cluster size of 5 nodes , I tried to pass 
"join(otherDataset, [numTasks])" as numTasks=10 but again what I noticed that 
all the 9 tasks are finished instantly and only 1 executor is processing all 
the data.

I searched on internet and got that we can use Broadcast variable to send data 
from 1 file to all nodes and then use map function to do the join. In this way 
I should be able to run multiple task on different executors.
Now my question is , since Spark is providing the Join functionality, I have 
assumed that it will handle the data parallelism automatically. Now is Spark 
provide some functionality which I can directly use for join rather than 
implementing Mapside join using Broadcast on my own or any other better way is 
also welcome.

I assume that this might be very common problem for all and looking out for 
suggestions.

Thanks &Regards
Stuti Awasthi



::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

Reply via email to