[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208888#comment-15208888 ]
Bikas Saha commented on TEZ-2442: --------------------------------- IIRC, this is the same for both kinds of shuffle. Because consumers can fetch and merge spills as they happen in a pipelined manner as they get the DME for each spilled output. The physical fetch method (HTTP or FS) is likely not relevant. [~rajesh.balamohan] can correct me if this is inaccurate. > Support DFS based shuffle in addition to HTTP shuffle > ----------------------------------------------------- > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.5.3 > Reporter: Kannan Rajah > Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)