[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529476#comment-15529476 ]
Liang-Chi Hsieh edited comment on SPARK-17556 at 9/28/16 12:41 PM: ------------------------------------------------------------------- For example, assume we have 3 executors and the broadcasting object is split to 3 pieces too. You think executor side broadcast will end to 3 executors connecting to 3 executors. Let us see. t1: E1(p1) E2(p2) E3(p3) t2: E1 going to fetch p2 from E2, E2 going to fetch p3 from E3, E3 going to fetch p2 from E2 E1(p1, p2) E2(p2, p3) E3(p2, p3) t3: E1 going to fetch p3 from E2, E2 going to fetch p1 from E1, E3 going to fetch p1 from E1 E1(p1, p2, p3) E2(p1, p2, p3) E3(p1, p2, p3) Now all executors get all pieces of data. During the broadcast, E1 connected to E2, E2 connected to E1, E3, E3 connected to E1, E2. In above, E1 doesn't connect to E3. The simple analysis is based on the assumption that the operations is synchronized. But it already shows that the BitTorrent approach can relieve the all-to-all transferring problem. was (Author: viirya): For example, assume we have 3 executors and the broadcasting object is split to 3 pieces too. You think executor side broadcast will end to 3 executors connecting to 3 executors. Let us see. t1: E1(p1) E2(p2) E3(p3) t2: E1 going to fetch p2 from E2, E2 going to fetch p3 from E3, E3 going to fetch p2 from E2 E1(p1, p2) E2(p2, p3) E3(p2, p3) t3: E1 going to fetch p3 from E2, E2 going to fetch p1 from E1, E3 going to fetch p1 from E1 E1(p1, p2, p3) E2(p1, p2, p3) E3(p1, p2, p3) Now all executors get all pieces of data. During the broadcast, E1 connected to E2, E2 connected to E1, E3, E3 connected to E1, E2. In above, E1 doesn't connect to E3. The simple analysis is based on the assumption that the operations is synchronize. But it already shows that the BitTorrent approach can relieve the all-to-all transferring problem. > Executor side broadcast for broadcast joins > ------------------------------------------- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL > Reporter: Reynold Xin > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org