[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

Xuefu Zhang (JIRA) Sun, 21 Sep 2014 07:47:44 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142466#comment-14142466
 ]


Xuefu Zhang commented on SPARK-3621:
------------------------------------

I understand RDD is a concept existing only in the driver. However, accessing 
the data in Spark job doesn't have to be in the form of RDD. An iterator over 
the underlying data is sufficient, as long as the data is already shipped to 
the node when the job starts to run. One way to identify the shipped RDD and 
the iterator afterwards could be a UUID.

Hive on Spark isn't using Spark's transformations to do map-join, or join in 
general. Hive's own implementation is to build hash maps for the small tables 
when the join starts, and then do key lookups while streaming thru the big 
table. For this, small table data (which can be a result RDD of another Spark 
job) needs to be shipped to all nodes that do the join.

> Provide a way to broadcast an RDD (instead of just a variable made of the 
> RDD) so that a job can access
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3621
>                 URL: https://issues.apache.org/jira/browse/SPARK-3621
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Xuefu Zhang
>
> In some cases, such as Hive's way of doing map-side join, it would be 
> benefcial to allow client program to broadcast RDDs rather than just 
> variables made of these RDDs. Broadcasting a variable made of RDDs requires 
> all RDD data be collected to the driver and that the variable be shipped to 
> the cluster after being made. It would be more performing if driver just 
> broadcasts the RDDs and uses the corresponding data in jobs (such building 
> hashmaps at executors).
> Tez has a broadcast edge which can ship data from previous stage to the next 
> stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

Reply via email to