[ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360896#comment-14360896
 ] 

Josh Rosen commented on SPARK-6313:
-----------------------------------

Thanks for the pointer to the Lucene lock factory code.

It's fine for the locks to be advisory in the sense that things shouldn't break 
if multiple executors acquire the lock and try to download the same file, but 
there's potentially a problem if the lock isn't released after the JVM that 
acquired it exits abnormally, since this could cause other executors to block 
indefinitely while waiting for the original lock owner to download the file.  
One approach might be to write the PID of the original lock owner into the lock 
file, which would allow blocked executors to timeout and re-attempt the lock 
acquisition if they detect that the original lock holder died.  This might face 
its own portability challenges, though, and seems complex.

A simple hotfix might be to add a SparkConf setting to always force this 
caching to bypassed (this would be a two-line change to Executor.scala).  This 
might lose the performance benefits of the caching, though.

If you're using NFS and the shared filesystem is mounted at the same path on 
all nodes, I think that you should be able to use use {{local://path/to/nfs/}} 
to specify the paths to your files / JARs, which will cause them to be read 
from the executor-local filesystem rather than fetched remotely.  In this case, 
this would cause them to be read from NFS, so you may be able to use this 
technique to recover any performance benefits for large files that would be 
lost in disabling the caching.

I'd be happy to review patches for this issue.

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-6313
>                 URL: https://issues.apache.org/jira/browse/SPARK-6313
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0, 1.3.0, 1.2.1
>            Reporter: Nathan McCarthy
>            Priority: Critical
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to