RE: Any "Replicated" RDD in Spark?

Shuai Zheng Thu, 06 Nov 2014 08:14:24 -0800

Matei,

Thanks for reply.


I don't worry that much about more code because I migrate from mapreduce, so
I have existing code to handle it. But if I want to use a new tech, I will
always prefer "right" way not a temporary easy way!. I will go with RDD
first to test the performance.

Thanks!

Shuai

-----Original Message-----
From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Wednesday, November 05, 2014 6:27 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any "Replicated" RDD in Spark?

If you start with an RDD, you do have to collect to the driver and broadcast
to do this. Between the two options you listed, I think this one is simpler
to implement, and there won't be a huge difference in performance, so you
can go for it. Opening InputStreams to a distributed file system by hand can
be a lot of code.

Matei

> On Nov 5, 2014, at 12:37 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:
> 
> And another similar case:
> 
> If I have get a RDD from previous step, but for next step it should be 
> a map side join (so I need to broadcast this RDD to every nodes). What 
> is the best way for me to do that? Collect RDD in driver first and 
> create broadcast? Or any shortcut in spark for this?
> 
> Thanks!
> 
> -----Original Message-----
> From: Shuai Zheng [mailto:szheng.c...@gmail.com]
> Sent: Wednesday, November 05, 2014 3:32 PM
> To: 'Matei Zaharia'
> Cc: 'user@spark.apache.org'
> Subject: RE: Any "Replicated" RDD in Spark?
> 
> Nice.
> 
> Then I have another question, if I have a file (or a set of files: 
> part-0, part-1, might be a few hundreds MB csv to 1-2 GB, created by 
> other program), need to create hashtable from it, later broadcast it 
> to each node to allow query (map side join). I have two options to do it:
> 
> 1, I can just load the file in a general code (open a inputstream, 
> etc), parse content and then create the broadcast from it.
> 2, I also can use a standard way to create the RDD from these file, 
> run the map to parse it, then collect it as map, wrap the result as 
> broadcast to push to all nodes again.
> 
> I think the option 2 might be more consistent with spark's concept 
> (and less code?)? But how about the performance? The gain is can 
> parallel load and parse the data, penalty is after load we need to 
> collect and broadcast result again? Please share your opinion. I am 
> not sure what is the best practice here (in theory, either way works, 
> but in real world, which one is better?).
> 
> Regards,
> 
> Shuai
> 
> -----Original Message-----
> From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
> Sent: Monday, November 03, 2014 4:15 PM
> To: Shuai Zheng
> Cc: user@spark.apache.org
> Subject: Re: Any "Replicated" RDD in Spark?
> 
> You need to use broadcast followed by flatMap or mapPartitions to do 
> map-side joins (in your map function, you can look at the hash table 
> you broadcast and see what records match it). Spark SQL also does it 
> by default for tables smaller than the 
> spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which 
> is really small, but you can bump this up with set
> spark.sql.autoBroadcastJoinThreshold=1000000 for example).
> 
> Matei
> 
>> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:
>> 
>> Hi All,
>> 
>> I have spent last two years on hadoop but new to spark.
>> I am planning to move one of my existing system to spark to get some
> enhanced features.
>> 
>> My question is:
>> 
>> If I try to do a map side join (something similar to "Replicated" key 
>> word
> in Pig), how can I do it? Is it anyway to declare a RDD as "replicated"
> (means distribute it to all nodes and each node will have a full copy)?
>> 
>> I know I can use accumulator to get this feature, but I am not sure 
>> what
> is the best practice. And if I accumulator to broadcast the data set, 
> can then (after broadcast) convert it into a RDD and do the join?
>> 
>> Regards,
>> 
>> Shuai
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Any "Replicated" RDD in Spark?

Reply via email to