My suspect is your input file partitions are small. Hence small number of
tasks are started.  Can you provide some more details like how you load the
files and how  the result size is around 500GBs ?

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Thu, Mar 17, 2016 at 12:12 PM, Stuti Awasthi <stutiawas...@hcl.com>
wrote:

> Hi All,
>
>
>
> I have to join 2 files both not very big say few MBs only but the result
> can be huge say generating 500GBs to TBs of data.  Now I have tried using
> spark Join() function but Im noticing that join is executing on only 1 or 2
> nodes at the max. Since I have a cluster size of 5 nodes , I tried to pass “
> join(*otherDataset*, [*numTasks*])” as numTasks=10 but again what I
> noticed that all the 9 tasks are finished instantly and only 1 executor is
> processing all the data.
>
>
>
> I searched on internet and got that we can use Broadcast variable to send
> data from 1 file to all nodes and then use map function to do the join. In
> this way I should be able to run multiple task on different executors.
>
> Now my question is , since Spark is providing the Join functionality, I
> have assumed that it will handle the data parallelism automatically. Now is
> Spark provide some functionality which I can directly use for join rather
> than implementing Mapside join using Broadcast on my own or any other
> better way is also welcome.
>
>
>
> I assume that this might be very common problem for all and looking out
> for suggestions.
>
>
>
> Thanks &Regards
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>

Reply via email to