For torrent broadcast, data are read directly through the block manager: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L167
On Thu, Apr 9, 2015 at 7:27 AM, Zoltán Zvara <zoltan.zv...@gmail.com> wrote: > Thanks! I've found the fetcher! Is there any other places and cases where > blocks are traveled through network? > > Zvara Zoltán > > > > mail, hangout, skype: zoltan.zv...@gmail.com > > mobile, viber: +36203129543 > > bank: 10918001-00000021-50480008 > > address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a > > elte: HSKSJZ (ZVZOAAI.ELTE) > > 2015-04-09 10:24 GMT+02:00 Reynold Xin <r...@databricks.com>: > >> Take a look at the following two files: >> >> >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala >> >> and >> >> >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala >> >> On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara <zoltan.zv...@gmail.com> >> wrote: >> >>> Dear Developers, >>> >>> I'm trying to investigate the communication pattern regarding data-flow >>> during execution of a Spark program defined by an RDD chain. I'm >>> investigating from the Task point of view, and found out that the task >>> type >>> ResultTask (as retrieving the iterator for its RDD for a given >>> partition), >>> effectively asks the BlockManager to get the block from local or remote >>> location. What I do there is to include actual location data in >>> BlockResult >>> so the task can tell where it retrieved the data from. I've found out >>> that >>> ResultTask can issue a data-flow only in this case. >>> >>> What's the case with the ShuffleMapTask? What happens there? I'm trying >>> to >>> log locations which are included in the shuffle process. I would be happy >>> to receive a few hints regarding where remote communication is managed in >>> case of ShuffleMapTask. >>> >>> Thanks! >>> >>> Zoltán >>> >> >> >