[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952263#comment-14952263 ] Shay Rojansky commented on SPARK-7736: -- Have just tested this with Spark 1.5.1 on Yarn 2.7.1 and the problem is still there - an exception thrown after the SparkContext has been created terminates the application but Yarn reports it as succeeded. > Exception not failing Python applications (in yarn cluster mode) > > > Key: SPARK-7736 > URL: https://issues.apache.org/jira/browse/SPARK-7736 > Project: Spark > Issue Type: Bug > Components: YARN > Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Assignee: Marcelo Vanzin > Fix For: 1.5.1, 1.6.0 > > > It seems that exceptions thrown in Python spark apps after the SparkContext > is instantiated don't cause the application to fail, at least in Yarn: the > application is marked as SUCCEEDED. > Note that any exception right before the SparkContext correctly places the > application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources
[ https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630799#comment-14630799 ] Shay Rojansky commented on SPARK-8119: -- Thanks Andrew! > HeartbeatReceiver should not adjust application executor resources > -- > > Key: SPARK-8119 > URL: https://issues.apache.org/jira/browse/SPARK-8119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: SaintBacchus >Assignee: Andrew Or >Priority: Critical > Fix For: 1.5.0 > > > DynamicAllocation will set the total executor to a little number when it > wants to kill some executors. > But in no-DynamicAllocation scenario, Spark will also set the total executor. > So it will cause such problem: sometimes an executor fails down, there is no > more executor which will be pull up by spark. > === EDIT by andrewor14 === > The issue is that the AM forgets about the original number of executors it > wants after calling sc.killExecutor. Even if dynamic allocation is not > enabled, this is still possible because of heartbeat timeouts. > I think the problem is that sc.killExecutor is used incorrectly in > HeartbeatReceiver. The intention of the method is to permanently adjust the > number of executors the application will get. In HeartbeatReceiver, however, > this is used as a best-effort mechanism to ensure that the timed out executor > is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources
[ https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630380#comment-14630380 ] Shay Rojansky commented on SPARK-8119: -- Will this really not be fixed before 1.5? This issue makes Spark 1.4 unusable in a Yarn environment where preemption may happen > HeartbeatReceiver should not adjust application executor resources > -- > > Key: SPARK-8119 > URL: https://issues.apache.org/jira/browse/SPARK-8119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: SaintBacchus >Assignee: Andrew Or >Priority: Critical > > DynamicAllocation will set the total executor to a little number when it > wants to kill some executors. > But in no-DynamicAllocation scenario, Spark will also set the total executor. > So it will cause such problem: sometimes an executor fails down, there is no > more executor which will be pull up by spark. > === EDIT by andrewor14 === > The issue is that the AM forgets about the original number of executors it > wants after calling sc.killExecutor. Even if dynamic allocation is not > enabled, this is still possible because of heartbeat timeouts. > I think the problem is that sc.killExecutor is used incorrectly in > HeartbeatReceiver. The intention of the method is to permanently adjust the > number of executors the application will get. In HeartbeatReceiver, however, > this is used as a best-effort mechanism to ensure that the timed out executor > is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621832#comment-14621832 ] Shay Rojansky commented on SPARK-7736: -- Neelesh, not sure I understood what you're saying exactly... I agree with Esben that at the end of the day, if a Spark application fails (by throwing an exception), and does so on all Yarn application attempts, that the Yarn status of that application definitely should be FAILED... > Exception not failing Python applications (in yarn cluster mode) > > > Key: SPARK-7736 > URL: https://issues.apache.org/jira/browse/SPARK-7736 > Project: Spark > Issue Type: Bug > Components: YARN > Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 >Reporter: Shay Rojansky > > It seems that exceptions thrown in Python spark apps after the SparkContext > is instantiated don't cause the application to fail, at least in Yarn: the > application is marked as SUCCEEDED. > Note that any exception right before the SparkContext correctly places the > application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605425#comment-14605425 ] Shay Rojansky commented on SPARK-8374: -- Thanks for your comment and sure, I can help test. I may need a bit of hand-holding since I haven't built Spark yet. > Job frequently hangs after YARN preemption > -- > > Key: SPARK-8374 > URL: https://issues.apache.org/jira/browse/SPARK-8374 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 > Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Priority: Critical > > After upgrading to Spark 1.4.0, jobs that get preempted very frequently will > not reacquire executors and will therefore hang. To reproduce: > 1. I run Spark job A that acquires all grid resources > 2. I run Spark job B in a higher-priority queue that acquires all grid > resources. Job A is fully preempted. > 3. Kill job B, releasing all resources > 4. Job A should at this point reacquire all grid resources, but occasionally > doesn't. Repeating the preemption scenario makes the bad behavior occur > within a few attempts. > (see logs at bottom). > Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption > issues, maybe the work there is related to the new issues. > The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've > downgraded to 1.3.1 just because of this issue). > Logs > -- > When job B (the preemptor first acquires an application master, the following > is logged by job A (the preemptee): > {noformat} > ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, > g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) > INFO DAGScheduler: Executor lost: 447 (epoch 0) > INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from > BlockManagerMaster. > INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, > g023.grid.eaglerd.local, 41406) > INFO BlockManagerMaster: Removed 447 successfully in removeExecutor > {noformat} > (It's strange for errors/warnings to be logged for preemption) > Later, when job B's AM starts requesting its resources, I get lots of the > following in job A: > {noformat} > ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 > WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, > g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > {noformat} > Finally, when I kill job B, job A emits lots of the following: > {noformat} > INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 > WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! > {noformat} > And finally after some time: > {noformat} > WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: > 165964 ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost an executor 466 (already removed): Executor > heartbeat timed out after 165964 ms > {noformat} > At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption
[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605117#comment-14605117 ] Shay Rojansky commented on SPARK-8374: -- Any chance someone can look at this bug, at least to confirm it? This is a pretty serious issue preventing Spark 1.4 use in YARN where preemption may happen... > Job frequently hangs after YARN preemption > -- > > Key: SPARK-8374 > URL: https://issues.apache.org/jira/browse/SPARK-8374 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.4.0 > Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Priority: Critical > > After upgrading to Spark 1.4.0, jobs that get preempted very frequently will > not reacquire executors and will therefore hang. To reproduce: > 1. I run Spark job A that acquires all grid resources > 2. I run Spark job B in a higher-priority queue that acquires all grid > resources. Job A is fully preempted. > 3. Kill job B, releasing all resources > 4. Job A should at this point reacquire all grid resources, but occasionally > doesn't. Repeating the preemption scenario makes the bad behavior occur > within a few attempts. > (see logs at bottom). > Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption > issues, maybe the work there is related to the new issues. > The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've > downgraded to 1.3.1 just because of this issue). > Logs > -- > When job B (the preemptor first acquires an application master, the following > is logged by job A (the preemptee): > {noformat} > ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, > g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) > INFO DAGScheduler: Executor lost: 447 (epoch 0) > INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from > BlockManagerMaster. > INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, > g023.grid.eaglerd.local, 41406) > INFO BlockManagerMaster: Removed 447 successfully in removeExecutor > {noformat} > (It's strange for errors/warnings to be logged for preemption) > Later, when job B's AM starts requesting its resources, I get lots of the > following in job A: > {noformat} > ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 > WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, > g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > {noformat} > Finally, when I kill job B, job A emits lots of the following: > {noformat} > INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 > WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! > {noformat} > And finally after some time: > {noformat} > WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: > 165964 ms exceeds timeout 12 ms > ERROR YarnScheduler: Lost an executor 466 (already removed): Executor > heartbeat timed out after 165964 ms > {noformat} > At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601923#comment-14601923 ] Shay Rojansky commented on SPARK-7736: -- The problem is simply with the YARN status for the application. If a Spark application throws an exception after having instantiated the SparkContext, the application obviously terminates but YARN lists the job as SUCCEEDED. This makes it hard for users to see what happened to their jobs in the YARN UI. Let me know if this is still unclear. > Exception not failing Python applications (in yarn cluster mode) > > > Key: SPARK-7736 > URL: https://issues.apache.org/jira/browse/SPARK-7736 > Project: Spark > Issue Type: Bug > Components: YARN > Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 >Reporter: Shay Rojansky > > It seems that exceptions thrown in Python spark apps after the SparkContext > is instantiated don't cause the application to fail, at least in Yarn: the > application is marked as SUCCEEDED. > Note that any exception right before the SparkContext correctly places the > application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8374) Job frequently hangs after YARN preemption
Shay Rojansky created SPARK-8374: Summary: Job frequently hangs after YARN preemption Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7725) --py-files doesn't seem to work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564421#comment-14564421 ] Shay Rojansky commented on SPARK-7725: -- Here you go: https://mail-archives.apache.org/mod_mbox/spark-user/201505.mbox/%3CCADT4RqDTwTmR_vRCz5THXbitfA%2BCOc-zBc3j6o7H-qQHKk--5w%40mail.gmail.com%3E > --py-files doesn't seem to work in YARN cluster mode > > > Key: SPARK-7725 > URL: https://issues.apache.org/jira/browse/SPARK-7725 > Project: Spark > Issue Type: Bug > Components: Deploy, YARN >Affects Versions: 1.3.1 > Environment: Ubuntu 14.04, YARN 2.7.0 on local filesystem >Reporter: Shay Rojansky > > I'm having issues with submitting a Spark Yarn job in cluster mode when the > cluster filesystem is file:///. It seems that additional resources > (--py-files) are simply being skipped and not being added into the > PYTHONPATH. The same issue may also exist for --jars, --files, etc. (I > haven't checked) > (I sent a mail to the Spark users list and Marcelo Vanzin confirms it's a > bug, unrelated to the local filesystem) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
Shay Rojansky created SPARK-7736: Summary: Exception not failing Python applications (in yarn cluster mode) Key: SPARK-7736 URL: https://issues.apache.org/jira/browse/SPARK-7736 Project: Spark Issue Type: Bug Components: YARN Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 Reporter: Shay Rojansky It seems that exceptions thrown in Python spark apps after the SparkContext is instantiated don't cause the application to fail, at least in Yarn: the application is marked as SUCCEEDED. Note that any exception right before the SparkContext correctly places the application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7709) spark-submit option to quit after submitting in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Rojansky closed SPARK-7709. Resolution: Duplicate Oops, it seems this was already implemented... > spark-submit option to quit after submitting in cluster mode > > > Key: SPARK-7709 > URL: https://issues.apache.org/jira/browse/SPARK-7709 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 1.3.1 >Reporter: Shay Rojansky >Priority: Minor > > When deploying in cluster mode, spark-submit continues polling the > application every second. While this is a useful feature, there should be an > option to have spark-submit exit immediately after submission completes. This > would allow scripts to figure out that a job was successfully (or > unsuccessfully) submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7725) --py-files doesn't seem to work in YARN cluster mode
Shay Rojansky created SPARK-7725: Summary: --py-files doesn't seem to work in YARN cluster mode Key: SPARK-7725 URL: https://issues.apache.org/jira/browse/SPARK-7725 Project: Spark Issue Type: Bug Components: Deploy, YARN Affects Versions: 1.3.1 Environment: Ubuntu 14.04, YARN 2.7.0 on local filesystem Reporter: Shay Rojansky I'm having issues with submitting a Spark Yarn job in cluster mode when the cluster filesystem is file:///. It seems that additional resources (--py-files) are simply being skipped and not being added into the PYTHONPATH. The same issue may also exist for --jars, --files, etc. (I haven't checked) (I sent a mail to the Spark users list and Marcelo Vanzin confirms it's a bug, unrelated to the local filesystem) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7709) spark-submit option to quit after submitting in cluster mode
Shay Rojansky created SPARK-7709: Summary: spark-submit option to quit after submitting in cluster mode Key: SPARK-7709 URL: https://issues.apache.org/jira/browse/SPARK-7709 Project: Spark Issue Type: New Feature Components: Deploy Affects Versions: 1.3.1 Reporter: Shay Rojansky Priority: Minor When deploying in cluster mode, spark-submit continues polling the application every second. While this is a useful feature, there should be an option to have spark-submit exit immediately after submission completes. This would allow scripts to figure out that a job was successfully (or unsuccessfully) submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547073#comment-14547073 ] Shay Rojansky commented on SPARK-3644: -- +1 on this, our main use would be to get progress information on a running spark job. SPARK-5925 exists for exposing Spark progress through the generic Yarn progress, but as the commenter there points out it isn't how to expose the complicated multi-stage Spark progress as a simple progress bar. Hence full REST access to the state of a job would be necessary. > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Josh Rosen > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, how > the API will be documented and tested, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
[ https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128285#comment-14128285 ] Shay Rojansky commented on SPARK-3470: -- Good point about AutoCloseable. Yes, the idea is for Closeable to call stop(). I'd submit a PR myself but I don't know any Scala whatsoever... > Have JavaSparkContext implement Closeable/AutoCloseable > --- > > Key: SPARK-3470 > URL: https://issues.apache.org/jira/browse/SPARK-3470 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Shay Rojansky >Priority: Minor > > After discussion in SPARK-2972, it seems like a good idea to allow Java > developers to use Java 7 automatic resource management with JavaSparkContext, > like so: > {code:java} > try (JavaSparkContext ctx = new JavaSparkContext(...)) { >return br.readLine(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
[ https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Rojansky closed SPARK-2972. Resolution: Won't Fix > APPLICATION_COMPLETE not created in Python unless context explicitly stopped > > > Key: SPARK-2972 > URL: https://issues.apache.org/jira/browse/SPARK-2972 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2 > Environment: Cloudera 5.1, yarn master on ubuntu precise >Reporter: Shay Rojansky > > If you don't explicitly stop a SparkContext at the end of a Python > application with sc.stop(), an APPLICATION_COMPLETE file isn't created and > the job doesn't get picked up by the history server. > This can be easily reproduced with pyspark (but affects scripts as well). > The current workaround is to wrap the entire script with a try/finally and > stop manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3471) Automatic resource manager for SparkContext in Scala?
Shay Rojansky created SPARK-3471: Summary: Automatic resource manager for SparkContext in Scala? Key: SPARK-3471 URL: https://issues.apache.org/jira/browse/SPARK-3471 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to add "automatic resource management" semantics to SparkContext (i.e. "with" in Python (SPARK-3458), Closeable/AutoCloseable in Java (SPARK-3470)). I have no knowledge of Scala whatsoever, but a quick search seems to indicate that there isn't a standard mechanism for this - someone with real Scala knowledge should take a look and make a decision... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable
Shay Rojansky created SPARK-3470: Summary: Have JavaSparkContext implement Closeable/AutoCloseable Key: SPARK-3470 URL: https://issues.apache.org/jira/browse/SPARK-3470 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.2 Reporter: Shay Rojansky Priority: Minor After discussion in SPARK-2972, it seems like a good idea to allow Java developers to use Java 7 automatic resource management with JavaSparkContext, like so: {code:java} try (JavaSparkContext ctx = new JavaSparkContext(...)) { return br.readLine(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
[ https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127171#comment-14127171 ] Shay Rojansky commented on SPARK-2972: -- I'd love to help on this, but I know 0 Scala (I could have helped with the Python though :)). A quick search shows that Scala has no Python 'with' or Java Closeable equivalent in Java. There are several third-party implementations out there, but it doesn't seem right to bring in a non-core library for this kind of thing. I think someone with real Scala knowledge should take a look at this. We can close this issue and open a separate one for the Scala closeability if you want. > APPLICATION_COMPLETE not created in Python unless context explicitly stopped > > > Key: SPARK-2972 > URL: https://issues.apache.org/jira/browse/SPARK-2972 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2 > Environment: Cloudera 5.1, yarn master on ubuntu precise >Reporter: Shay Rojansky > > If you don't explicitly stop a SparkContext at the end of a Python > application with sc.stop(), an APPLICATION_COMPLETE file isn't created and > the job doesn't get picked up by the history server. > This can be easily reproduced with pyspark (but affects scripts as well). > The current workaround is to wrap the entire script with a try/finally and > stop manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3457) ConcurrentModificationException starting up pyspark
[ https://issues.apache.org/jira/browse/SPARK-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Rojansky updated SPARK-3457: - Description: Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in yarn-client mode (no additional params or anything), I got the exception below. Rerunning pyspark 5 times afterwards did not reproduce the issue. {code} 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1410275267606 yarnAppState: RUNNING 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master. grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011, /proxy/application_1410268447887_0011 Traceback (most recent call last): File "/opt/spark/python/pyspark/shell.py", line 44, in 14/09/09 18:07:58 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter sc = SparkContext(appName="PySparkShell", pyFiles=add_files) File "/opt/spark/python/pyspark/context.py", line 107, in __init__ conf) File "/opt/spark/python/pyspark/context.py", line 155, in _do_init self._jsc = self._initialize_context(self._conf._jconf) File "/opt/spark/python/pyspark/context.py", line 201, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__ File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.util.ConcurrentModificationException at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454) at scala.collection.Iterator$class.toStream(Iterator.scala:1143) at scala.collection.AbstractIterator.toStream(Iterator.scala:1157) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149) at scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream.length(Stream.scala:284) at scala.collection.SeqLike$class.sorted(SeqLike.scala:608) at scala.collection.AbstractSeq.sorted(Seq.scala:40) at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297) at org.apache.spark.SparkContext.(SparkContext.scala:334) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) {code} was: Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in yarn-client mode (no additional params or anything), I got the exception below. Rerunning pyspark 5 times afterwards did not reproduce the issue. 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1410275267606 yarnAppState: RUNNING 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master. grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
[jira] [Created] (SPARK-3457) ConcurrentModificationException starting up pyspark
Shay Rojansky created SPARK-3457: Summary: ConcurrentModificationException starting up pyspark Key: SPARK-3457 URL: https://issues.apache.org/jira/browse/SPARK-3457 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Hadoop 2.3 (CDH 5.1) on Ubuntu precise Reporter: Shay Rojansky Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in yarn-client mode (no additional params or anything), I got the exception below. Rerunning pyspark 5 times afterwards did not reproduce the issue. 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: 0 appStartTime: 1410275267606 yarnAppState: RUNNING 14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master. grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011, /proxy/application_1410268447887_0011 Traceback (most recent call last): File "/opt/spark/python/pyspark/shell.py", line 44, in 14/09/09 18:07:58 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter sc = SparkContext(appName="PySparkShell", pyFiles=add_files) File "/opt/spark/python/pyspark/context.py", line 107, in __init__ conf) File "/opt/spark/python/pyspark/context.py", line 155, in _do_init self._jsc = self._initialize_context(self._conf._jconf) File "/opt/spark/python/pyspark/context.py", line 201, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__ File "/opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.util.ConcurrentModificationException at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458) at scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454) at scala.collection.Iterator$class.toStream(Iterator.scala:1143) at scala.collection.AbstractIterator.toStream(Iterator.scala:1157) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149) at scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream.length(Stream.scala:284) at scala.collection.SeqLike$class.sorted(SeqLike.scala:608) at scala.collection.AbstractSeq.sorted(Seq.scala:40) at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297) at org.apache.spark.SparkContext.(SparkContext.scala:334) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
[ https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126686#comment-14126686 ] Shay Rojansky commented on SPARK-2972: -- > you're right! imho, this means your program is written better than the > examples. it would be good to enhance the examples w/ try/finally semantics. > however, Then I can submit a pull request for that, no problem. > getting the shutdown semantics right is difficult, and may not apply broadly > across applications. for instance, your application may want to catch a > failure in stop() and retry to make sure that a history record is written. > another application may be ok w/ best effort writing history events. still > another application may want to exit w/o stop() to avoid having a history > event written. I don't think explicit stop() should be removed - of course users may choose to manually manage stop(), catch exceptions and retry, etc. For me it's just a question of what to do with a context that *didn't* get explicitly closed at the end of the application. As to apps that need to exit without a history event - it's a requirement that's hard to imagine (for me). At least with YARN/Mesos you will be leaving traces anyway, and these traces will be partial and difficult to understand, since the corresponding Spark traces haven't been produced. > asking the context creator to do context destruction shifts burden to the > application writer and maintains flexibility for applications. I guess it's a question of how high-level a tool you want Spark to be. It seems a bit strange for Spark to handle so much of the troublesome low-level details, while forcing the user to boilerplate-wrap all their programs with try/finally. But I do understand the points you're making and it can be argued both ways. As a minimum, I suggest having context implement the language-specific dispose patterns ('using' in Java, 'with' in Python), so at least the code looks better? > APPLICATION_COMPLETE not created in Python unless context explicitly stopped > > > Key: SPARK-2972 > URL: https://issues.apache.org/jira/browse/SPARK-2972 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2 > Environment: Cloudera 5.1, yarn master on ubuntu precise >Reporter: Shay Rojansky > > If you don't explicitly stop a SparkContext at the end of a Python > application with sc.stop(), an APPLICATION_COMPLETE file isn't created and > the job doesn't get picked up by the history server. > This can be easily reproduced with pyspark (but affects scripts as well). > The current workaround is to wrap the entire script with a try/finally and > stop manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
[ https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124873#comment-14124873 ] Shay Rojansky commented on SPARK-2972: -- Thanks for answering. I guess it's a debatable question. I admit I expected the context to shut itself down at application exit, a bit in the way that files and other resources get closed. Note that the way the examples are currently written (pi.py), an exception anywhere in the code would bypass sc.stop() and the Spark application disappears without leaving a trace in the history server. For this reason, my scripts all contain try/finally blocks around the application code, which seems like needless boilerplate that complicates life and can easily be forgotten. Is there any specific reason not to use the application shutdown hooks available in python/java to close the context(s)? > APPLICATION_COMPLETE not created in Python unless context explicitly stopped > > > Key: SPARK-2972 > URL: https://issues.apache.org/jira/browse/SPARK-2972 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.2 > Environment: Cloudera 5.1, yarn master on ubuntu precise >Reporter: Shay Rojansky > > If you don't explicitly stop a SparkContext at the end of a Python > application with sc.stop(), an APPLICATION_COMPLETE file isn't created and > the job doesn't get picked up by the history server. > This can be easily reproduced with pyspark (but affects scripts as well). > The current workaround is to wrap the entire script with a try/finally and > stop manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3183) Add option for requesting full YARN cluster
[ https://issues.apache.org/jira/browse/SPARK-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114368#comment-14114368 ] Shay Rojansky commented on SPARK-3183: -- +1. As a current workaround for cores, we specify a number well beyond the YARN cluster capacity. This gets handled well by Spark/YARN, and we get the entire cluster. > Add option for requesting full YARN cluster > --- > > Key: SPARK-3183 > URL: https://issues.apache.org/jira/browse/SPARK-3183 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Sandy Ryza > > This could possibly be in the form of --executor-cores ALL --executor-memory > ALL --num-executors ALL. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped
Shay Rojansky created SPARK-2972: Summary: APPLICATION_COMPLETE not created in Python unless context explicitly stopped Key: SPARK-2972 URL: https://issues.apache.org/jira/browse/SPARK-2972 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.2 Environment: Cloudera 5.1, yarn master on ubuntu precise Reporter: Shay Rojansky If you don't explicitly stop a SparkContext at the end of a Python application with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't get picked up by the history server. This can be easily reproduced with pyspark (but affects scripts as well). The current workaround is to wrap the entire script with a try/finally and stop manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever
Shay Rojansky created SPARK-2971: Summary: Orphaned YARN ApplicationMaster lingers forever Key: SPARK-2971 URL: https://issues.apache.org/jira/browse/SPARK-2971 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise Reporter: Shay Rojansky We have cases where if CTRL-C is hit during a Spark job startup, a YARN ApplicationMaster is created but cannot connect to the driver (presumably because the driver has terminated). Once an AM enters this state it never exits it, and has to be manually killed in YARN. Here's an excerpt from the AM logs: {noformat} SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(roji) 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started 14/08/11 16:29:40 INFO Remoting: Starting remoting 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075] 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at master.grid.eaglerd.local/192.168.41.100:8030 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1407759736957_0014_01 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be reachable. 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at master.grid.eaglerd.local:44911, retrying ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2960) Spark executables failed to start via symlinks
Shay Rojansky created SPARK-2960: Summary: Spark executables failed to start via symlinks Key: SPARK-2960 URL: https://issues.apache.org/jira/browse/SPARK-2960 Project: Spark Issue Type: Bug Reporter: Shay Rojansky Fix For: 1.0.2 The current scripts (e.g. pyspark) fail to run when they are executed via symlinks. A common Linux scenario would be to have Spark installed somewhere (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Rojansky updated SPARK-2960: - Summary: Spark executables fail to start via symlinks (was: Spark executables failed to start via symlinks) > Spark executables fail to start via symlinks > > > Key: SPARK-2960 > URL: https://issues.apache.org/jira/browse/SPARK-2960 > Project: Spark > Issue Type: Bug >Reporter: Shay Rojansky > Fix For: 1.0.2 > > > The current scripts (e.g. pyspark) fail to run when they are executed via > symlinks. A common Linux scenario would be to have Spark installed somewhere > (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shay Rojansky updated SPARK-2960: - Priority: Minor (was: Major) > Spark executables fail to start via symlinks > > > Key: SPARK-2960 > URL: https://issues.apache.org/jira/browse/SPARK-2960 > Project: Spark > Issue Type: Bug >Reporter: Shay Rojansky >Priority: Minor > Fix For: 1.0.2 > > > The current scripts (e.g. pyspark) fail to run when they are executed via > symlinks. A common Linux scenario would be to have Spark installed somewhere > (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration
[ https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092058#comment-14092058 ] Shay Rojansky commented on SPARK-2945: -- I just did a quick test on Spark 1.0.2, and spark.executor.instances does indeed appear to control the number of executors allocated (at least in YARN). Should I keep this open for you guys to take a look and update the docs? > Allow specifying num of executors in the context configuration > -- > > Key: SPARK-2945 > URL: https://issues.apache.org/jira/browse/SPARK-2945 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.0 > Environment: Ubuntu precise, on YARN (CDH 5.1.0) >Reporter: Shay Rojansky > > Running on YARN, the only way to specify the number of executors seems to be > on the command line of spark-submit, via the --num-executors switch. > In many cases this is too early. Our Spark app receives some cmdline > arguments which determine the amount of work that needs to be done - and that > affects the number of executors it ideally requires. Ideally, the Spark > context configuration would support specifying this like any other config > param. > Our current workaround is a wrapper script that determines how much work is > needed, and which itself launches spark-submit with the number passed to > --num-executors - it's a shame to have to do this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration
[ https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091702#comment-14091702 ] Shay Rojansky commented on SPARK-2945: -- That would be great news indeed, I didn't find any documentation for this... I'll test this tomorrow and confirm here whether it works as expected. Thanks! > Allow specifying num of executors in the context configuration > -- > > Key: SPARK-2945 > URL: https://issues.apache.org/jira/browse/SPARK-2945 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.0 > Environment: Ubuntu precise, on YARN (CDH 5.1.0) >Reporter: Shay Rojansky > > Running on YARN, the only way to specify the number of executors seems to be > on the command line of spark-submit, via the --num-executors switch. > In many cases this is too early. Our Spark app receives some cmdline > arguments which determine the amount of work that needs to be done - and that > affects the number of executors it ideally requires. Ideally, the Spark > context configuration would support specifying this like any other config > param. > Our current workaround is a wrapper script that determines how much work is > needed, and which itself launches spark-submit with the number passed to > --num-executors - it's a shame to have to do this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2946) Allow specifying * for --num-executors in YARN
Shay Rojansky created SPARK-2946: Summary: Allow specifying * for --num-executors in YARN Key: SPARK-2946 URL: https://issues.apache.org/jira/browse/SPARK-2946 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Environment: Ubuntu precise, on YARN (CDH 5.1.0) Reporter: Shay Rojansky Priority: Minor It would be useful to allow specifying --num-executors * when submitting jobs to YARN, and to have Spark automatically determine how many total cores are available in the cluster by querying YARN. Our scenario is multiple users running research batch jobs. We never want to have a situation where cluster resources aren't being used, so ideally users would specify * and let YARN scheduling and preemption ensure fairness. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2945) Allow specifying num of executors in the context configuration
Shay Rojansky created SPARK-2945: Summary: Allow specifying num of executors in the context configuration Key: SPARK-2945 URL: https://issues.apache.org/jira/browse/SPARK-2945 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Environment: Ubuntu precise, on YARN (CDH 5.1.0) Reporter: Shay Rojansky Running on YARN, the only way to specify the number of executors seems to be on the command line of spark-submit, via the --num-executors switch. In many cases this is too early. Our Spark app receives some cmdline arguments which determine the amount of work that needs to be done - and that affects the number of executors it ideally requires. Ideally, the Spark context configuration would support specifying this like any other config param. Our current workaround is a wrapper script that determines how much work is needed, and which itself launches spark-submit with the number passed to --num-executors - it's a shame to have to do this. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2880) spark-submit processes app cmdline options
[ https://issues.apache.org/jira/browse/SPARK-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090880#comment-14090880 ] Shay Rojansky commented on SPARK-2880: -- It's indeed a duplicate of that bug, great to see that it was fixed! Thanks Patrick. > spark-submit processes app cmdline options > -- > > Key: SPARK-2880 > URL: https://issues.apache.org/jira/browse/SPARK-2880 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Cloudera 5.1 on Ubuntu precise >Reporter: Shay Rojansky >Priority: Minor > Labels: newbie > > The usage for spark-submit is: > Usage: spark-submit [options] [app options] > However, when running my Python app thus: > spark-submit test.py -v > The -v gets picked up by spark-submit, which enters verbose mode. The correct > behavior seems to be for test.py to receive this parameter. > First time using Spark and submitting, will be happy to contribute a patch if > this is validated as a bug. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2880) spark-submit processes app cmdline options
Shay Rojansky created SPARK-2880: Summary: spark-submit processes app cmdline options Key: SPARK-2880 URL: https://issues.apache.org/jira/browse/SPARK-2880 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Cloudera 5.1 on Ubuntu precise Reporter: Shay Rojansky Priority: Minor The usage for spark-submit is: Usage: spark-submit [options] [app options] However, when running my Python app thus: spark-submit test.py -v The -v gets picked up by spark-submit, which enters verbose mode. The correct behavior seems to be for test.py to receive this parameter. First time using Spark and submitting, will be happy to contribute a patch if this is validated as a bug. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org