[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493251#comment-14493251 ]
Kannan Rajah commented on SPARK-1529: ------------------------------------- You can use the Compare functionality to see a single page of diffs across commits. Here is the link: https://github.com/rkannan82/spark/compare/4aaf48d46d13129f0f9bdafd771dd80fe568a7dc...rkannan82:7195353a31f7cfb087ec804b597b01fb362bc3f6 A few clarifications. 1. There are 2 reasons for introducing a FileSystem abstraction in Spark instead of directly using Hadoop FileSystem. - There are Spark shuffle specific APIs that needed abstraction. Please take a look at this code: https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/FileSystem.scala - For local file system access, we can choose to circumvent using Hadoop's local file system implementation if its not efficient. If you look at LocalFileSystem.scala, for most APIs, it just delegates to the old code of using Spark's disk block manager, etc. In fact, we can just look at this single class and determine if we will hit any performance degradation for the default Apache shuffle code path. https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/LocalFileSystem.scala 2. During the write phase, we shuffle to HDFS instead of local file system. While reading back, we don't use the Netty based transport that Apache shuffle uses. Instead we have a new implementation called DFSShuffleClient that reads from HDFS. That is the main difference. https://github.com/rkannan82/spark/blob/dfs_shuffle/network/shuffle/src/main/java/org/apache/spark/network/shuffle/DFSShuffleClient.java > Support setting spark.local.dirs to a hadoop FileSystem > -------------------------------------------------------- > > Key: SPARK-1529 > URL: https://issues.apache.org/jira/browse/SPARK-1529 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Patrick Wendell > Assignee: Kannan Rajah > Attachments: Spark Shuffle using HDFS.pdf > > > In some environments, like with MapR, local volumes are accessed through the > Hadoop filesystem interface. We should allow setting spark.local.dir to a > Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org