[ https://issues.apache.org/jira/browse/TEZ-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208898#comment-15208898 ]
Hitesh Shah commented on TEZ-2442: ---------------------------------- [~rkannan82] Are you still looking to contribute/implement this feature? It seems like [~shanyu] has also started taking a look at this. Would you mind if we re-assign this jira to him if you are not planning to look at this in the near future? > Support DFS based shuffle in addition to HTTP shuffle > ----------------------------------------------------- > > Key: TEZ-2442 > URL: https://issues.apache.org/jira/browse/TEZ-2442 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.5.3 > Reporter: Kannan Rajah > Assignee: Kannan Rajah > Attachments: HDFS_based_shuffle_v2.pdf, Tez Shuffle using DFS.pdf, > hdfs_broadcast_hack.txt, tez_hdfs_shuffle.patch > > > In Tez, Shuffle is a mechanism by which intermediate data can be shared > between stages. Shuffle data is written to local disk and fetched from any > remote node using HTTP. A DFS like MapR file system can support writing this > shuffle data directly to its DFS using a notion of local volumes and retrieve > it using HDFS API from remote node. The current Shuffle implementation > assumes local data can only be managed by LocalFileSystem. So it uses > RawLocalFileSystem and LocalDirAllocator. If we can remove this assumption > and introduce an abstraction to manage local disks, then we can reuse most of > the shuffle logic (store, sort) and inject a HDFS API based retrieval instead > of HTTP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)