[ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:
------------------------------------

    Assignee: Apache Spark

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16299
>                 URL: https://issues.apache.org/jira/browse/SPARK-16299
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.6.2
>            Reporter: Sun Rui
>            Assignee: Apache Spark
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -------------------------------------------------------------------------
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> ----------------------------------
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ... <Anonymous> -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>       at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>       at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>       at org.apache.spark.scheduler.Task.run(Task.scala:85)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>       source(script)
>       # Set SIGUSR1 so that child can exit
>       tools::pskill(Sys.getpid(), tools::SIGUSR1)
>       parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>       try(source(script))
>       # Set SIGUSR1 so that child can exit
>       tools::pskill(Sys.getpid(), tools::SIGUSR1)
>       parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to