[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-10-11 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952263#comment-14952263
 ] 

Shay Rojansky commented on SPARK-7736:
--

Have just tested this with Spark 1.5.1 on Yarn 2.7.1 and the problem is still 
there - an exception thrown after the SparkContext has been created terminates 
the application but Yarn reports it as succeeded.

> Exception not failing Python applications (in yarn cluster mode)
> 
>
> Key: SPARK-7736
> URL: https://issues.apache.org/jira/browse/SPARK-7736
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
> Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
>Reporter: Shay Rojansky
>Assignee: Marcelo Vanzin
> Fix For: 1.5.1, 1.6.0
>
>
> It seems that exceptions thrown in Python spark apps after the SparkContext 
> is instantiated don't cause the application to fail, at least in Yarn: the 
> application is marked as SUCCEEDED.
> Note that any exception right before the SparkContext correctly places the 
> application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources

2015-07-16 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630799#comment-14630799
 ] 

Shay Rojansky commented on SPARK-8119:
--

Thanks Andrew!

 HeartbeatReceiver should not adjust application executor resources
 --

 Key: SPARK-8119
 URL: https://issues.apache.org/jira/browse/SPARK-8119
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: SaintBacchus
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.5.0


 DynamicAllocation will set the total executor to a little number when it 
 wants to kill some executors.
 But in no-DynamicAllocation scenario, Spark will also set the total executor.
 So it will cause such problem: sometimes an executor fails down, there is no 
 more executor which will be pull up by spark.
 === EDIT by andrewor14 ===
 The issue is that the AM forgets about the original number of executors it 
 wants after calling sc.killExecutor. Even if dynamic allocation is not 
 enabled, this is still possible because of heartbeat timeouts.
 I think the problem is that sc.killExecutor is used incorrectly in 
 HeartbeatReceiver. The intention of the method is to permanently adjust the 
 number of executors the application will get. In HeartbeatReceiver, however, 
 this is used as a best-effort mechanism to ensure that the timed out executor 
 is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources

2015-07-16 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630380#comment-14630380
 ] 

Shay Rojansky commented on SPARK-8119:
--

Will this really not be fixed before 1.5? This issue makes Spark 1.4 unusable 
in a Yarn environment where preemption may happen

 HeartbeatReceiver should not adjust application executor resources
 --

 Key: SPARK-8119
 URL: https://issues.apache.org/jira/browse/SPARK-8119
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: SaintBacchus
Assignee: Andrew Or
Priority: Critical

 DynamicAllocation will set the total executor to a little number when it 
 wants to kill some executors.
 But in no-DynamicAllocation scenario, Spark will also set the total executor.
 So it will cause such problem: sometimes an executor fails down, there is no 
 more executor which will be pull up by spark.
 === EDIT by andrewor14 ===
 The issue is that the AM forgets about the original number of executors it 
 wants after calling sc.killExecutor. Even if dynamic allocation is not 
 enabled, this is still possible because of heartbeat timeouts.
 I think the problem is that sc.killExecutor is used incorrectly in 
 HeartbeatReceiver. The intention of the method is to permanently adjust the 
 number of executors the application will get. In HeartbeatReceiver, however, 
 this is used as a best-effort mechanism to ensure that the timed out executor 
 is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-07-10 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621832#comment-14621832
 ] 

Shay Rojansky commented on SPARK-7736:
--

Neelesh, not sure I understood what you're saying exactly... I agree with Esben 
that at the end of the day, if a Spark application fails (by throwing an 
exception), and does so on all Yarn application attempts, that the Yarn status 
of that application definitely should be FAILED...

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-29 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605425#comment-14605425
 ] 

Shay Rojansky commented on SPARK-8374:
--

Thanks for your comment and sure, I can help test. I may need a bit of 
hand-holding since I haven't built Spark yet.

 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-28 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605117#comment-14605117
 ] 

Shay Rojansky commented on SPARK-8374:
--

Any chance someone can look at this bug, at least to confirm it? This is a 
pretty serious issue preventing Spark 1.4 use in YARN where preemption may 
happen...

 Job frequently hangs after YARN preemption
 --

 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical

 After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
 not reacquire executors and will therefore hang. To reproduce:
 1. I run Spark job A that acquires all grid resources
 2. I run Spark job B in a higher-priority queue that acquires all grid 
 resources. Job A is fully preempted.
 3. Kill job B, releasing all resources
 4. Job A should at this point reacquire all grid resources, but occasionally 
 doesn't. Repeating the preemption scenario makes the bad behavior occur 
 within a few attempts.
 (see logs at bottom).
 Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
 issues, maybe the work there is related to the new issues.
 The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
 downgraded to 1.3.1 just because of this issue).
 Logs
 --
 When job B (the preemptor first acquires an application master, the following 
 is logged by job A (the preemptee):
 {noformat}
 ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
 g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
 INFO DAGScheduler: Executor lost: 447 (epoch 0)
 INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
 BlockManagerMaster.
 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
 g023.grid.eaglerd.local, 41406)
 INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
 {noformat}
 (It's strange for errors/warnings to be logged for preemption)
 Later, when job B's AM starts requesting its resources, I get lots of the 
 following in job A:
 {noformat}
 ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
 client disassociated
 INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
 g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
 WARN ReliableDeliverySupervisor: Association with remote system 
 [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address 
 is now gated for [5000] ms. Reason is: [Disassociated].
 {noformat}
 Finally, when I kill job B, job A emits lots of the following:
 {noformat}
 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
 {noformat}
 And finally after some time:
 {noformat}
 WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
 165964 ms exceeds timeout 12 ms
 ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
 heartbeat timed out after 165964 ms
 {noformat}
 At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-06-25 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601923#comment-14601923
 ] 

Shay Rojansky commented on SPARK-7736:
--

The problem is simply with the YARN status for the application. If a Spark 
application throws an exception after having instantiated the SparkContext, the 
application obviously terminates but YARN lists the job as SUCCEEDED. This 
makes it hard for users to see what happened to their jobs in the YARN UI.

Let me know if this is still unclear.

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-15 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-8374:


 Summary: Job frequently hangs after YARN preemption
 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical


After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
not reacquire executors and will therefore hang. To reproduce:

1. I run Spark job A that acquires all grid resources
2. I run Spark job B in a higher-priority queue that acquires all grid 
resources. Job A is fully preempted.
3. Kill job B, releasing all resources
4. Job A should at this point reacquire all grid resources, but occasionally 
doesn't. Repeating the preemption scenario makes the bad behavior occur within 
a few attempts.

(see logs at bottom).

Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
issues, maybe the work there is related to the new issues.

The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
downgraded to 1.3.1 just because of this issue).

Logs
--
When job B (the preemptor first acquires an application master, the following 
is logged by job A (the preemptee):

{noformat}
ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
INFO DAGScheduler: Executor lost: 447 (epoch 0)
INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
BlockManagerMaster.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
g023.grid.eaglerd.local, 41406)
INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
{noformat}

(It's strange for errors/warnings to be logged for preemption)

Later, when job B's AM starts requesting its resources, I get lots of the 
following in job A:

{noformat}
ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
{noformat}

Finally, when I kill job B, job A emits lots of the following:

{noformat}
INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
{noformat}

And finally after some time:

{noformat}
WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat 
timed out after 165964 ms
{noformat}

At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7725) --py-files doesn't seem to work in YARN cluster mode

2015-05-29 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564421#comment-14564421
 ] 

Shay Rojansky commented on SPARK-7725:
--

Here you go: 
https://mail-archives.apache.org/mod_mbox/spark-user/201505.mbox/%3CCADT4RqDTwTmR_vRCz5THXbitfA%2BCOc-zBc3j6o7H-qQHKk--5w%40mail.gmail.com%3E

 --py-files doesn't seem to work in YARN cluster mode
 

 Key: SPARK-7725
 URL: https://issues.apache.org/jira/browse/SPARK-7725
 Project: Spark
  Issue Type: Bug
  Components: Deploy, YARN
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04, YARN 2.7.0 on local filesystem
Reporter: Shay Rojansky

 I'm having issues with submitting a Spark Yarn job in cluster mode when the 
 cluster filesystem is file:///. It seems that additional resources 
 (--py-files) are simply being skipped and not being added into the 
 PYTHONPATH. The same issue may also exist for --jars, --files, etc. (I 
 haven't checked)
 (I sent a mail to the Spark users list and Marcelo Vanzin confirms it's a 
 bug, unrelated to the local filesystem)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7725) --py-files doesn't seem to work in YARN cluster mode

2015-05-19 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-7725:


 Summary: --py-files doesn't seem to work in YARN cluster mode
 Key: SPARK-7725
 URL: https://issues.apache.org/jira/browse/SPARK-7725
 Project: Spark
  Issue Type: Bug
  Components: Deploy, YARN
Affects Versions: 1.3.1
 Environment: Ubuntu 14.04, YARN 2.7.0 on local filesystem
Reporter: Shay Rojansky


I'm having issues with submitting a Spark Yarn job in cluster mode when the 
cluster filesystem is file:///. It seems that additional resources (--py-files) 
are simply being skipped and not being added into the PYTHONPATH. The same 
issue may also exist for --jars, --files, etc. (I haven't checked)

(I sent a mail to the Spark users list and Marcelo Vanzin confirms it's a bug, 
unrelated to the local filesystem)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7709) spark-submit option to quit after submitting in cluster mode

2015-05-19 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky closed SPARK-7709.

Resolution: Duplicate

Oops, it seems this was already implemented...

 spark-submit option to quit after submitting in cluster mode
 

 Key: SPARK-7709
 URL: https://issues.apache.org/jira/browse/SPARK-7709
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 1.3.1
Reporter: Shay Rojansky
Priority: Minor

 When deploying in cluster mode, spark-submit continues polling the 
 application every second. While this is a useful feature, there should be an 
 option to have spark-submit exit immediately after submission completes. This 
 would allow scripts to figure out that a job was successfully (or 
 unsuccessfully) submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-05-19 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-7736:


 Summary: Exception not failing Python applications (in yarn 
cluster mode)
 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky


It seems that exceptions thrown in Python spark apps after the SparkContext is 
instantiated don't cause the application to fail, at least in Yarn: the 
application is marked as SUCCEEDED.

Note that any exception right before the SparkContext correctly places the 
application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7709) spark-submit option to quit after submitting in cluster mode

2015-05-18 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-7709:


 Summary: spark-submit option to quit after submitting in cluster 
mode
 Key: SPARK-7709
 URL: https://issues.apache.org/jira/browse/SPARK-7709
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 1.3.1
Reporter: Shay Rojansky
Priority: Minor


When deploying in cluster mode, spark-submit continues polling the application 
every second. While this is a useful feature, there should be an option to have 
spark-submit exit immediately after submission completes. This would allow 
scripts to figure out that a job was successfully (or unsuccessfully) submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-05-17 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547073#comment-14547073
 ] 

Shay Rojansky commented on SPARK-3644:
--

+1 on this, our main use would be to get progress information on a running 
spark job.

SPARK-5925 exists for exposing Spark progress through the generic Yarn 
progress, but as the commenter there points out it isn't how to expose the 
complicated multi-stage Spark progress as a simple progress bar. Hence full 
REST access to the state of a job would be necessary.

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-10 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-3470:


 Summary: Have JavaSparkContext implement Closeable/AutoCloseable
 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor


After discussion in SPARK-2972, it seems like a good idea to allow Java 
developers to use Java 7 automatic resource management with JavaSparkContext, 
like so:

{code:java}
try (JavaSparkContext ctx = new JavaSparkContext(...)) {
   return br.readLine();
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3471) Automatic resource manager for SparkContext in Scala?

2014-09-10 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-3471:


 Summary: Automatic resource manager for SparkContext in Scala?
 Key: SPARK-3471
 URL: https://issues.apache.org/jira/browse/SPARK-3471
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor


After discussion in SPARK-2972, it seems like a good idea to add automatic 
resource management semantics to SparkContext (i.e. with in Python 
(SPARK-3458), Closeable/AutoCloseable in Java (SPARK-3470)).

I have no knowledge of Scala whatsoever, but a quick search seems to indicate 
that there isn't a standard mechanism for this - someone with real Scala 
knowledge should take a look and make a decision...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-10 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky closed SPARK-2972.

Resolution: Won't Fix

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3470) Have JavaSparkContext implement Closeable/AutoCloseable

2014-09-10 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128285#comment-14128285
 ] 

Shay Rojansky commented on SPARK-3470:
--

Good point about AutoCloseable. Yes, the idea is for Closeable to call stop(). 
I'd submit a PR myself but I don't know any Scala whatsoever...

 Have JavaSparkContext implement Closeable/AutoCloseable
 ---

 Key: SPARK-3470
 URL: https://issues.apache.org/jira/browse/SPARK-3470
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Shay Rojansky
Priority: Minor

 After discussion in SPARK-2972, it seems like a good idea to allow Java 
 developers to use Java 7 automatic resource management with JavaSparkContext, 
 like so:
 {code:java}
 try (JavaSparkContext ctx = new JavaSparkContext(...)) {
return br.readLine();
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126686#comment-14126686
 ] 

Shay Rojansky commented on SPARK-2972:
--

 you're right! imho, this means your program is written better than the 
 examples. it would be good to enhance the examples w/ try/finally semantics. 
 however,

Then I can submit a pull request for that, no problem.

 getting the shutdown semantics right is difficult, and may not apply broadly 
 across applications. for instance, your application may want to catch a 
 failure in stop() and retry to make sure that a history record is written. 
 another application may be ok w/ best effort writing history events. still 
 another application may want to exit w/o stop() to avoid having a history 
 event written.

I don't think explicit stop() should be removed - of course users may choose to 
manually manage stop(), catch exceptions and retry, etc. For me it's just a 
question of what to do with a context that *didn't* get explicitly closed at 
the end of the application.

As to apps that need to exit without a history event - it's a requirement 
that's hard to imagine (for me). At least with YARN/Mesos you will be leaving 
traces anyway, and these traces will be partial and difficult to understand, 
since the corresponding Spark traces haven't been produced.

 asking the context creator to do context destruction shifts burden to the 
 application writer and maintains flexibility for applications.

I guess it's a question of how high-level a tool you want Spark to be. It seems 
a bit strange for Spark to handle so much of the troublesome low-level details, 
while forcing the user to boilerplate-wrap all their programs with try/finally.

But I do understand the points you're making and it can be argued both ways. As 
a minimum, I suggest having context implement the language-specific dispose 
patterns ('using' in Java, 'with' in Python), so at least the code looks better?

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3457) ConcurrentModificationException starting up pyspark

2014-09-09 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-3457:


 Summary: ConcurrentModificationException starting up pyspark
 Key: SPARK-3457
 URL: https://issues.apache.org/jira/browse/SPARK-3457
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Hadoop 2.3 (CDH 5.1) on Ubuntu precise
Reporter: Shay Rojansky


Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
yarn-client mode (no additional params or anything), I got the exception below. 
Rerunning pyspark 5 times afterwards did not reproduce the issue.

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM:
 appMasterRpcPort: 0
 appStartTime: 1410275267606
 yarnAppState: RUNNING

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
 /proxy/application_1410268447887_0011
Traceback (most recent call last):
  File /opt/spark/python/pyspark/shell.py, line 44, in module
14/09/09 18:07:58 INFO JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /opt/spark/python/pyspark/context.py, line 107, in __init__
conf)
  File /opt/spark/python/pyspark/context.py, line 155, in _do_init
self._jsc = self._initialize_context(self._conf._jconf)
  File /opt/spark/python/pyspark/context.py, line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
701, in __call__
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1167)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.length(Stream.scala:284)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:608)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324)
at 
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297)
at org.apache.spark.SparkContext.init(SparkContext.scala:334)
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3457) ConcurrentModificationException starting up pyspark

2014-09-09 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky updated SPARK-3457:
-
Description: 
Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
yarn-client mode (no additional params or anything), I got the exception below. 
Rerunning pyspark 5 times afterwards did not reproduce the issue.

{code}
14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM:
 appMasterRpcPort: 0
 appStartTime: 1410275267606
 yarnAppState: RUNNING

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
 /proxy/application_1410268447887_0011
Traceback (most recent call last):
  File /opt/spark/python/pyspark/shell.py, line 44, in module
14/09/09 18:07:58 INFO JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File /opt/spark/python/pyspark/context.py, line 107, in __init__
conf)
  File /opt/spark/python/pyspark/context.py, line 155, in _do_init
self._jsc = self._initialize_context(self._conf._jconf)
  File /opt/spark/python/pyspark/context.py, line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
701, in __call__
  File /opt/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
: java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1167)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
at 
scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1157)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at 
scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at 
scala.collection.immutable.Stream$$anonfun$filteredTail$1.apply(Stream.scala:1149)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream.length(Stream.scala:284)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:608)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at org.apache.spark.SparkEnv$.environmentDetails(SparkEnv.scala:324)
at 
org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:1297)
at org.apache.spark.SparkContext.init(SparkContext.scala:334)
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Just downloaded Spark 1.1.0-rc4. Launching pyspark for the very first time in 
yarn-client mode (no additional params or anything), I got the exception below. 
Rerunning pyspark 5 times afterwards did not reproduce the issue.

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Application report from ASM:
 appMasterRpcPort: 0
 appStartTime: 1410275267606
 yarnAppState: RUNNING

14/09/09 18:07:58 INFO YarnClientSchedulerBackend: Add WebUI Filter. 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, PROXY_HOST=master.
grid.eaglerd.local,PROXY_URI_BASE=http://master.grid.eaglerd.local:8088/proxy/application_1410268447887_0011,
 

[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-09 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127171#comment-14127171
 ] 

Shay Rojansky commented on SPARK-2972:
--

I'd love to help on this, but I know 0 Scala (I could have helped with the 
Python though :)).

A quick search shows that Scala has no Python 'with' or Java Closeable 
equivalent in Java. There are several third-party implementations out there, 
but it doesn't seem right to bring in a non-core library for this kind of 
thing. I think someone with real Scala knowledge should take a look at this.

We can close this issue and open a separate one for the Scala closeability if 
you want.

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-09-07 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124873#comment-14124873
 ] 

Shay Rojansky commented on SPARK-2972:
--

Thanks for answering. I guess it's a debatable question. I admit I expected the 
context to shut itself down at application exit, a bit in the way that files 
and other resources get closed.

Note that the way the examples are currently written (pi.py), an exception 
anywhere in the code would bypass sc.stop() and the Spark application 
disappears without leaving a trace in the history server. For this reason, my 
scripts all contain try/finally blocks around the application code, which seems 
like needless boilerplate that complicates life and can easily be forgotten.

Is there any specific reason not to use the application shutdown hooks 
available in python/java to close the context(s)?

 APPLICATION_COMPLETE not created in Python unless context explicitly stopped
 

 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky

 If you don't explicitly stop a SparkContext at the end of a Python 
 application with sc.stop(), an APPLICATION_COMPLETE file isn't created and 
 the job doesn't get picked up by the history server.
 This can be easily reproduced with pyspark (but affects scripts as well).
 The current workaround is to wrap the entire script with a try/finally and 
 stop manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3183) Add option for requesting full YARN cluster

2014-08-28 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114368#comment-14114368
 ] 

Shay Rojansky commented on SPARK-3183:
--

+1.

As a current workaround for cores, we specify a number well beyond the YARN 
cluster capacity. This gets handled well by Spark/YARN, and we get the entire 
cluster.

 Add option for requesting full YARN cluster
 ---

 Key: SPARK-3183
 URL: https://issues.apache.org/jira/browse/SPARK-3183
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza

 This could possibly be in the form of --executor-cores ALL --executor-memory 
 ALL --num-executors ALL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever

2014-08-11 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2971:


 Summary: Orphaned YARN ApplicationMaster lingers forever
 Key: SPARK-2971
 URL: https://issues.apache.org/jira/browse/SPARK-2971
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise
Reporter: Shay Rojansky


We have cases where if CTRL-C is hit during a Spark job startup, a YARN 
ApplicationMaster is created but cannot connect to the driver (presumably 
because the driver has terminated). Once an AM enters this state it never exits 
it, and has to be manually killed in YARN.

Here's an excerpt from the AM logs:

{noformat}
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji
14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(roji)
14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started
14/08/11 16:29:40 INFO Remoting: Starting remoting
14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at 
master.grid.eaglerd.local/192.168.41.100:8030
14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: 
appattempt_1407759736957_0014_01
14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster
14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be 
reachable.
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-08-11 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2972:


 Summary: APPLICATION_COMPLETE not created in Python unless context 
explicitly stopped
 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky


If you don't explicitly stop a SparkContext at the end of a Python application 
with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't 
get picked up by the history server.

This can be easily reproduced with pyspark (but affects scripts as well).

The current workaround is to wrap the entire script with a try/finally and stop 
manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-10 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092058#comment-14092058
 ] 

Shay Rojansky commented on SPARK-2945:
--

I just did a quick test on Spark 1.0.2, and spark.executor.instances does 
indeed appear to control the number of executors allocated (at least in YARN).

Should I keep this open for you guys to take a look and update the docs?

 Allow specifying num of executors in the context configuration
 --

 Key: SPARK-2945
 URL: https://issues.apache.org/jira/browse/SPARK-2945
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky

 Running on YARN, the only way to specify the number of executors seems to be 
 on the command line of spark-submit, via the --num-executors switch.
 In many cases this is too early. Our Spark app receives some cmdline 
 arguments which determine the amount of work that needs to be done - and that 
 affects the number of executors it ideally requires. Ideally, the Spark 
 context configuration would support specifying this like any other config 
 param.
 Our current workaround is a wrapper script that determines how much work is 
 needed, and which itself launches spark-submit with the number passed to 
 --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks

2014-08-10 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky updated SPARK-2960:
-

Priority: Minor  (was: Major)

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
Priority: Minor
 Fix For: 1.0.2


 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks

2014-08-10 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky updated SPARK-2960:
-

Summary: Spark executables fail to start via symlinks  (was: Spark 
executables failed to start via symlinks)

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
 Fix For: 1.0.2


 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2960) Spark executables failed to start via symlinks

2014-08-10 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2960:


 Summary: Spark executables failed to start via symlinks
 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
 Fix For: 1.0.2


The current scripts (e.g. pyspark) fail to run when they are executed via 
symlinks. A common Linux scenario would be to have Spark installed somewhere 
(e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-09 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2945:


 Summary: Allow specifying num of executors in the context 
configuration
 Key: SPARK-2945
 URL: https://issues.apache.org/jira/browse/SPARK-2945
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky


Running on YARN, the only way to specify the number of executors seems to be on 
the command line of spark-submit, via the --num-executors switch.

In many cases this is too early. Our Spark app receives some cmdline arguments 
which determine the amount of work that needs to be done - and that affects the 
number of executors it ideally requires. Ideally, the Spark context 
configuration would support specifying this like any other config param.

Our current workaround is a wrapper script that determines how much work is 
needed, and which itself launches spark-submit with the number passed to 
--num-executors - it's a shame to have to do this.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2946) Allow specifying * for --num-executors in YARN

2014-08-09 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2946:


 Summary: Allow specifying * for --num-executors in YARN
 Key: SPARK-2946
 URL: https://issues.apache.org/jira/browse/SPARK-2946
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky
Priority: Minor


It would be useful to allow specifying --num-executors * when submitting jobs 
to YARN, and to have Spark automatically determine how many total cores are 
available in the cluster by querying YARN.

Our scenario is multiple users running research batch jobs. We never want to 
have a situation where cluster resources aren't being used, so ideally users 
would specify * and let YARN scheduling and preemption ensure fairness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2880) spark-submit processes app cmdline options

2014-08-08 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090880#comment-14090880
 ] 

Shay Rojansky commented on SPARK-2880:
--

It's indeed a duplicate of that bug, great to see that it was fixed!

Thanks Patrick.

 spark-submit processes app cmdline options
 --

 Key: SPARK-2880
 URL: https://issues.apache.org/jira/browse/SPARK-2880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Cloudera 5.1 on Ubuntu precise
Reporter: Shay Rojansky
Priority: Minor
  Labels: newbie

 The usage for spark-submit is:
 Usage: spark-submit [options] app jar | python file [app options]
 However, when running my Python app thus:
 spark-submit test.py -v
 The -v gets picked up by spark-submit, which enters verbose mode. The correct 
 behavior seems to be for test.py to receive this parameter.
 First time using Spark and submitting, will be happy to contribute a patch if 
 this is validated as a bug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2880) spark-submit processes app cmdline options

2014-08-06 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2880:


 Summary: spark-submit processes app cmdline options
 Key: SPARK-2880
 URL: https://issues.apache.org/jira/browse/SPARK-2880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Cloudera 5.1 on Ubuntu precise
Reporter: Shay Rojansky
Priority: Minor


The usage for spark-submit is:
Usage: spark-submit [options] app jar | python file [app options]

However, when running my Python app thus:
spark-submit test.py -v

The -v gets picked up by spark-submit, which enters verbose mode. The correct 
behavior seems to be for test.py to receive this parameter.

First time using Spark and submitting, will be happy to contribute a patch if 
this is validated as a bug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org