[ https://issues.apache.org/jira/browse/SPARK-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Liu updated SPARK-2288: ------------------------------- Attachment: shuffleblockmanager.pdf > Hide ShuffleBlockManager behind ShuffleManager > ---------------------------------------------- > > Key: SPARK-2288 > URL: https://issues.apache.org/jira/browse/SPARK-2288 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Shuffle > Reporter: Raymond Liu > Assignee: Raymond Liu > Attachments: shuffleblockmanager.pdf > > > This is a sub task for SPARK-2275. > At present, In shuffle write path, the shuffle block manager manage the > mapping from some blockID to a FileSegment for the benefit of consolidate > shuffle, this way it bypass the block store's blockId based access mode. Then > in the read path, when read a shuffle block data, disk store query > shuffleBlockManager to hack the normal blockId to file mapping in order to > correctly read data from file. This really rend to a lot of bi-directional > dependencies between modules and the code logic is some how messed up. None > of the shuffle block manager and blockManager/Disk Store fully control the > read path. They are tightly coupled in low level code modules. And it make it > hard to implement other shuffle manager logics. e.g. a sort based shuffle > which might merge all output from one map partition to a single file. This > will need to hack more into the diskStore/diskBlockManager etc to find out > the right data to be read. > Possible approaching: > So I think it might be better that we expose an FileSegment based read > interface for DiskStore in addition to the current blockID based interface. > Then those mapping blockId to FileSegment code logic can all reside in the > specific shuffle manager, if they do need to merge data into one single > object. they take care of the mapping logic in both read/write path and take > the responsibility of read / write shuffle data > The BlockStore itself should just take care of read/write as required, it > should not involve into the data mapping logic at all. This might make the > interface between modules more clear and decouple each other in a more clean > way. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org