[ https://issues.apache.org/jira/browse/PIG-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15469771#comment-15469771 ]
Xianda Ke commented on PIG-5024: -------------------------------- To enable broadcast mechanism, add add a broadcast physical operator. BroadcastConverter just broadcast the predecessor RDD and save the broadcast variable to a map, which can be referenced by other function/closures. Now, RDDConverter.convert() will take three parameters: predecessor RDDs, broadcasted variables map and a PhysicalOperator > add a physical operator to broadcast small RDDs > ----------------------------------------------- > > Key: PIG-5024 > URL: https://issues.apache.org/jira/browse/PIG-5024 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Xianda Ke > Assignee: Xianda Ke > Fix For: spark-branch > > > Currently, when optimize some kinds of JOIN, the indexed or sampling files > are saved into HDFS. By setting the replication to a larger number, it serves > as distributed cache. > Spark's broadcast mechanism is suitable for this. It seems that we can add a > physical operator to broadcast small RDDs. > This will benefit the optimization of some specialized Joins, such as Skewed > Join, Replicated Join and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)