[ https://issues.apache.org/jira/browse/PIG-5024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xianda Ke updated PIG-5024: --------------------------- Attachment: PIG-5024_4.patch > add a physical operator to broadcast small RDDs > ----------------------------------------------- > > Key: PIG-5024 > URL: https://issues.apache.org/jira/browse/PIG-5024 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Xianda Ke > Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-5024.patch, PIG-5024_2.patch, PIG-5024_3.patch, > PIG-5024_4.patch > > > Currently, when optimize some kinds of JOIN, the indexed or sampling files > are saved into HDFS. By setting the replication to a larger number, it serves > as distributed cache. > Spark's broadcast mechanism is suitable for this. It seems that we can add a > physical operator to broadcast small RDDs. > This will benefit the optimization of some specialized Joins, such as Skewed > Join, Replicated Join and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)