[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651984#comment-14651984 ] partha bishnu commented on SPARK-9559: -- Thanks. If I understand correctly --num-executor is for deploying on Yarn cluster and --total-executor-cores for spark stand-alone cluster. I am using spark stand-alone cluster. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651907#comment-14651907 ] partha bishnu commented on SPARK-9559: -- The expected behavior should be that the spark master on n-1 should restart the jobs with one new executor under the running worker jvm on the other worker node n-2 that is up and running after the n-3 went down. Isn that expected behavior ? But that does not happen. Thanks for your comments Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651923#comment-14651923 ] Sean Owen commented on SPARK-9559: -- OK so you have requested 1 total executor. Did the job fail then? or are you talking about the state after it completed? Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651924#comment-14651924 ] Sean Owen commented on SPARK-9559: -- PS you should try reproducing this on master rather than 1.3, which is relatively old at this stage. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652933#comment-14652933 ] partha bishnu commented on SPARK-9559: -- We tested on 1.4.1 and got same results i.e. a new executor JVM did not get started on the other worker node after the node running the jobs stopped running. So it seems a like a major defect. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651965#comment-14651965 ] partha bishnu commented on SPARK-9559: -- Hi Yes..I requested 1 executor like I mentioned in the original description [ I used --total-executor-cores 1 with spark_submit] We are using 1.3 so far and as you suggested to use 1.4, we will look into it and try to reproduce on 1.4 and report back. Again Thanks for looking into it. Again to recap: With options: --total-executor-core 1 and check-pointing enabled, I have: node-1: Spark Master running node-2: 1 worker jvm running and can start at most one executor node-3: 1 worker jvm and can start at most one executor. I launch jobs using spark_submit that started jobs in one executor on node-2 I killed node-2 (both worker jvm and executor) Expected behavior: Spark master should ask worker JVM on node-3 to launch a new executor and restart the jobs in that executor. Observed behavior: Jobs got stuck Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651968#comment-14651968 ] Sean Owen commented on SPARK-9559: -- total-executor-cores isn't the same as num-executors but 1 total core must mean 1 executor, yes. Use master (which is nearly 1.5), not 1.4, just to be most useful. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651882#comment-14651882 ] partha bishnu commented on SPARK-9559: -- Hi I am running some tests on spark in stand-alone mode with 3 nodes cluster. spark master is running on n-1, and slaves are on n-2 and n-3. Each machine is with 8G RAM and 4 core cpu. I am trying to test worker redundancy. I wanted to set up the cluster such a way so that there will be two worker JVM, one on each slave (n-2 and n-3) after I start up the cluster. Then one of the slave's worker JVM will launch the executor jvm to process the tasks when I submit the job with the following flags: ---total-executor-cores 1 and --executor-memory 1G (1) Job submitted successfully in client mode. n-2 had worker jvm launched a executor jvm. So now n-2 had one worker jvm and one executor jvm running and n-3 just had the worker jvm running as before. (2) I killed the worker jvm and the executor jvm on n-2 (3) I expected spark master on n-1 will then ask the worker jvm on n-3 to launch a new executor to start processing jobs but that did not happen. driver just got hung on the screen. n-2 disappeared from spark cluster as expected. n-3 just had the worker jvm running as before and no new executor was launched as expected after n-2 disappeared. Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9559) Worker redundancy/failover in spark stand-alone mode
[ https://issues.apache.org/jira/browse/SPARK-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651891#comment-14651891 ] Sean Owen commented on SPARK-9559: -- You should see 1 executor per worker. You lost an entire worker, so your jobs now use 1 executor each. I think this is expected behavior? Worker redundancy/failover in spark stand-alone mode Key: SPARK-9559 URL: https://issues.apache.org/jira/browse/SPARK-9559 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: partha bishnu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org