[ https://issues.apache.org/jira/browse/SPARK-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133104#comment-15133104 ]
Apache Spark commented on SPARK-13136: -------------------------------------- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/11083 > Data exchange (shuffle, broadcast) should only be handled by the exchange > operator > ---------------------------------------------------------------------------------- > > Key: SPARK-13136 > URL: https://issues.apache.org/jira/browse/SPARK-13136 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Reynold Xin > > In an ideal architecture, we have a very small number of physical operators > that handle data exchanges, and the rest simply declare the input data > distribution needed and let the planner inject the right Exchange operators. > We have almost that, except the following few operators: > 1. Limit: does its own shuffle or collect to get data to a single partition. > 2. Except: does its own shuffle; note that this operator is going away and > will be replaced by anti-join (SPARK-12660). > 3. broadcast joins: broadcast joins do its own broadcast, which is a form of > data exchange. > Here are a straw man for limit. Split the current Limit operator into two: a > partition-local limit and a terminal limit. Partition-local limit is just a > normal unary operator. The terminal limit requires the input data > distribution to be a single partition, and then takes its own limit. We then > update the planner (strategies) to turn a logical limit into a partition > local limit and a terminal limit. > For broadcast join, it is more involved. We would need to design the > interface for the physical operators (e.g. we are no longer taking an > iterator as input on the probe side), and allow Exchange to handle data > broadcast. > Note that this is an important step towards creating a clear delineation > between distributed query execution and single-threaded query execution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org