[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

Xuefu Zhang (JIRA) Mon, 02 Mar 2015 11:03:18 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343582#comment-14343582
 ]


Xuefu Zhang commented on SPARK-3621:
------------------------------------

For Hive's map join, we create a hash table out of a small table (HadoopRDD in 
spark's term) after some transformations. We want to broadcast the hash table 
(which is written to HDFS) such that each executor will be able to access it to 
do the join. We thought of spark's broadcast variable for this purpose. 
However, Spark's broadcast variable will ship the data to the driver and then 
broadcast to every executor. We wanted to avoid this extra trip since the hash 
table is already in HDFS. Thus, we wanted a mechanism to broadcast the dataset 
and make the dataset available (even better if in memory) at each executor, w/o 
shipping the dataset back to the driver. Referring this dataset as an RDD might 
have caused the confusion at the first place.

Currently, we worked around the problem by calling SparkContext.addFile() at 
the driver and accessing it using SparkFiles.get() at the executor.

> Provide a way to broadcast an RDD (instead of just a variable made of the 
> RDD) so that a job can access
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3621
>                 URL: https://issues.apache.org/jira/browse/SPARK-3621
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Xuefu Zhang
>
> In some cases, such as Hive's way of doing map-side join, it would be 
> benefcial to allow client program to broadcast RDDs rather than just 
> variables made of these RDDs. Broadcasting a variable made of RDDs requires 
> all RDD data be collected to the driver and that the variable be shipped to 
> the cluster after being made. It would be more performing if driver just 
> broadcasts the RDDs and uses the corresponding data in jobs (such building 
> hashmaps at executors).
> Tez has a broadcast edge which can ship data from previous stage to the next 
> stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

Reply via email to