RE: Any Replicated RDD in Spark?

2014-11-06 Thread Shuai Zheng
Matei,

Thanks for reply.

I don't worry that much about more code because I migrate from mapreduce, so
I have existing code to handle it. But if I want to use a new tech, I will
always prefer right way not a temporary easy way!. I will go with RDD
first to test the performance.

Thanks!

Shuai

-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Wednesday, November 05, 2014 6:27 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any Replicated RDD in Spark?

If you start with an RDD, you do have to collect to the driver and broadcast
to do this. Between the two options you listed, I think this one is simpler
to implement, and there won't be a huge difference in performance, so you
can go for it. Opening InputStreams to a distributed file system by hand can
be a lot of code.

Matei

 On Nov 5, 2014, at 12:37 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 And another similar case:
 
 If I have get a RDD from previous step, but for next step it should be 
 a map side join (so I need to broadcast this RDD to every nodes). What 
 is the best way for me to do that? Collect RDD in driver first and 
 create broadcast? Or any shortcut in spark for this?
 
 Thanks!
 
 -Original Message-
 From: Shuai Zheng [mailto:szheng.c...@gmail.com]
 Sent: Wednesday, November 05, 2014 3:32 PM
 To: 'Matei Zaharia'
 Cc: 'user@spark.apache.org'
 Subject: RE: Any Replicated RDD in Spark?
 
 Nice.
 
 Then I have another question, if I have a file (or a set of files: 
 part-0, part-1, might be a few hundreds MB csv to 1-2 GB, created by 
 other program), need to create hashtable from it, later broadcast it 
 to each node to allow query (map side join). I have two options to do it:
 
 1, I can just load the file in a general code (open a inputstream, 
 etc), parse content and then create the broadcast from it.
 2, I also can use a standard way to create the RDD from these file, 
 run the map to parse it, then collect it as map, wrap the result as 
 broadcast to push to all nodes again.
 
 I think the option 2 might be more consistent with spark's concept 
 (and less code?)? But how about the performance? The gain is can 
 parallel load and parse the data, penalty is after load we need to 
 collect and broadcast result again? Please share your opinion. I am 
 not sure what is the best practice here (in theory, either way works, 
 but in real world, which one is better?).
 
 Regards,
 
 Shuai
 
 -Original Message-
 From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
 Sent: Monday, November 03, 2014 4:15 PM
 To: Shuai Zheng
 Cc: user@spark.apache.org
 Subject: Re: Any Replicated RDD in Spark?
 
 You need to use broadcast followed by flatMap or mapPartitions to do 
 map-side joins (in your map function, you can look at the hash table 
 you broadcast and see what records match it). Spark SQL also does it 
 by default for tables smaller than the 
 spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which 
 is really small, but you can bump this up with set
 spark.sql.autoBroadcastJoinThreshold=100 for example).
 
 Matei
 
 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some
 enhanced features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key 
 word
 in Pig), how can I do it? Is it anyway to declare a RDD as replicated
 (means distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure 
 what
 is the best practice. And if I accumulator to broadcast the data set, 
 can then (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
Nice.

Then I have another question, if I have a file (or a set of files: part-0,
part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
need to create hashtable from it, later broadcast it to each node to allow
query (map side join). I have two options to do it:

1, I can just load the file in a general code (open a inputstream, etc),
parse content and then create the broadcast from it. 
2, I also can use a standard way to create the RDD from these file, run the
map to parse it, then collect it as map, wrap the result as broadcast to
push to all nodes again.

I think the option 2 might be more consistent with spark's concept (and less
code?)? But how about the performance? The gain is can parallel load and
parse the data, penalty is after load we need to collect and broadcast
result again? Please share your opinion. I am not sure what is the best
practice here (in theory, either way works, but in real world, which one is
better?). 

Regards,

Shuai

-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Monday, November 03, 2014 4:15 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any Replicated RDD in Spark?

You need to use broadcast followed by flatMap or mapPartitions to do
map-side joins (in your map function, you can look at the hash table you
broadcast and see what records match it). Spark SQL also does it by default
for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
default 10 KB, which is really small, but you can bump this up with set
spark.sql.autoBroadcastJoinThreshold=100 for example).

Matei

 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some
enhanced features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key word
in Pig), how can I do it? Is it anyway to declare a RDD as replicated
(means distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure what
is the best practice. And if I accumulator to broadcast the data set, can
then (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
And another similar case:

If I have get a RDD from previous step, but for next step it should be a map
side join (so I need to broadcast this RDD to every nodes). What is the best
way for me to do that? Collect RDD in driver first and create broadcast? Or
any shortcut in spark for this?

Thanks!

-Original Message-
From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Wednesday, November 05, 2014 3:32 PM
To: 'Matei Zaharia'
Cc: 'user@spark.apache.org'
Subject: RE: Any Replicated RDD in Spark?

Nice.

Then I have another question, if I have a file (or a set of files: part-0,
part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
need to create hashtable from it, later broadcast it to each node to allow
query (map side join). I have two options to do it:

1, I can just load the file in a general code (open a inputstream, etc),
parse content and then create the broadcast from it. 
2, I also can use a standard way to create the RDD from these file, run the
map to parse it, then collect it as map, wrap the result as broadcast to
push to all nodes again.

I think the option 2 might be more consistent with spark's concept (and less
code?)? But how about the performance? The gain is can parallel load and
parse the data, penalty is after load we need to collect and broadcast
result again? Please share your opinion. I am not sure what is the best
practice here (in theory, either way works, but in real world, which one is
better?). 

Regards,

Shuai

-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
Sent: Monday, November 03, 2014 4:15 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Any Replicated RDD in Spark?

You need to use broadcast followed by flatMap or mapPartitions to do
map-side joins (in your map function, you can look at the hash table you
broadcast and see what records match it). Spark SQL also does it by default
for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
default 10 KB, which is really small, but you can bump this up with set
spark.sql.autoBroadcastJoinThreshold=100 for example).

Matei

 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some
enhanced features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key word
in Pig), how can I do it? Is it anyway to declare a RDD as replicated
(means distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure what
is the best practice. And if I accumulator to broadcast the data set, can
then (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any Replicated RDD in Spark?

2014-11-05 Thread Matei Zaharia
If you start with an RDD, you do have to collect to the driver and broadcast to 
do this. Between the two options you listed, I think this one is simpler to 
implement, and there won't be a huge difference in performance, so you can go 
for it. Opening InputStreams to a distributed file system by hand can be a lot 
of code.

Matei

 On Nov 5, 2014, at 12:37 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 And another similar case:
 
 If I have get a RDD from previous step, but for next step it should be a map
 side join (so I need to broadcast this RDD to every nodes). What is the best
 way for me to do that? Collect RDD in driver first and create broadcast? Or
 any shortcut in spark for this?
 
 Thanks!
 
 -Original Message-
 From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
 Sent: Wednesday, November 05, 2014 3:32 PM
 To: 'Matei Zaharia'
 Cc: 'user@spark.apache.org'
 Subject: RE: Any Replicated RDD in Spark?
 
 Nice.
 
 Then I have another question, if I have a file (or a set of files: part-0,
 part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
 need to create hashtable from it, later broadcast it to each node to allow
 query (map side join). I have two options to do it:
 
 1, I can just load the file in a general code (open a inputstream, etc),
 parse content and then create the broadcast from it. 
 2, I also can use a standard way to create the RDD from these file, run the
 map to parse it, then collect it as map, wrap the result as broadcast to
 push to all nodes again.
 
 I think the option 2 might be more consistent with spark's concept (and less
 code?)? But how about the performance? The gain is can parallel load and
 parse the data, penalty is after load we need to collect and broadcast
 result again? Please share your opinion. I am not sure what is the best
 practice here (in theory, either way works, but in real world, which one is
 better?). 
 
 Regards,
 
 Shuai
 
 -Original Message-
 From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
 Sent: Monday, November 03, 2014 4:15 PM
 To: Shuai Zheng
 Cc: user@spark.apache.org
 Subject: Re: Any Replicated RDD in Spark?
 
 You need to use broadcast followed by flatMap or mapPartitions to do
 map-side joins (in your map function, you can look at the hash table you
 broadcast and see what records match it). Spark SQL also does it by default
 for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
 default 10 KB, which is really small, but you can bump this up with set
 spark.sql.autoBroadcastJoinThreshold=100 for example).
 
 Matei
 
 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some
 enhanced features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key word
 in Pig), how can I do it? Is it anyway to declare a RDD as replicated
 (means distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure what
 is the best practice. And if I accumulator to broadcast the data set, can
 then (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any Replicated RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side 
joins (in your map function, you can look at the hash table you broadcast and 
see what records match it). Spark SQL also does it by default for tables 
smaller than the spark.sql.autoBroadcastJoinThreshold setting (by default 10 
KB, which is really small, but you can bump this up with set 
spark.sql.autoBroadcastJoinThreshold=100 for example).

Matei

 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some enhanced 
 features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key word in 
 Pig), how can I do it? Is it anyway to declare a RDD as replicated (means 
 distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure what is 
 the best practice. And if I accumulator to broadcast the data set, can then 
 (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org