[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493251#comment-14493251
 ] 

Kannan Rajah commented on SPARK-1529:
-------------------------------------

You can use the Compare functionality to see a single page of diffs across 
commits. Here is the link: 
https://github.com/rkannan82/spark/compare/4aaf48d46d13129f0f9bdafd771dd80fe568a7dc...rkannan82:7195353a31f7cfb087ec804b597b01fb362bc3f6

A few clarifications.
1. There are 2 reasons for introducing a FileSystem abstraction in Spark 
instead of directly using Hadoop FileSystem.
  - There are Spark shuffle specific APIs that needed abstraction. Please take 
a look at this code:
https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/FileSystem.scala

  - For local file system access, we can choose to circumvent using Hadoop's 
local file system implementation if its not efficient. If you look at 
LocalFileSystem.scala, for most APIs, it just delegates to the old code of 
using Spark's disk block manager, etc. In fact, we can just look at this single 
class and determine if we will hit any performance degradation for the default 
Apache shuffle code path.
https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/LocalFileSystem.scala

2. During the write phase, we shuffle to HDFS instead of local file system. 
While reading back, we don't use the Netty based transport that Apache shuffle 
uses. Instead we have a new implementation called DFSShuffleClient that reads 
from HDFS. That is the main difference.
https://github.com/rkannan82/spark/blob/dfs_shuffle/network/shuffle/src/main/java/org/apache/spark/network/shuffle/DFSShuffleClient.java

> Support setting spark.local.dirs to a hadoop FileSystem 
> --------------------------------------------------------
>
>                 Key: SPARK-1529
>                 URL: https://issues.apache.org/jira/browse/SPARK-1529
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Kannan Rajah
>         Attachments: Spark Shuffle using HDFS.pdf
>
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to