[jira] [Commented] (SPARK-32151) Kafka does not allow Partition Rebalance Handling

2020-07-16 Thread Ed Mitchell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159330#comment-17159330
 ] 

Ed Mitchell commented on SPARK-32151:
-

Yeah - I think you're right. It's going to be in the write ahead logs or the 
checkpoint.

> Kafka does not allow Partition Rebalance Handling
> -
>
> Key: SPARK-32151
> URL: https://issues.apache.org/jira/browse/SPARK-32151
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Ed Mitchell
>Priority: Minor
>
> When a consumer group rebalance occurs when the Spark driver is using the 
> Subscribe or Subscribe Pattern ConsumerStrategy, driver's offsets are cleared 
> when partitions are revoked and then reassigned.
> While this doesn't happen in the normal rebalance scenario of more consumers 
> joining the group (though it could), it does happen when the partition leader 
> is reelected because of a Kafka node being stopped or decommissioned.
> This seems to only occur when users specify their own offsets and do not use 
> Kafka as the persistent store of offsets (they use their own database, and 
> possibly if using checkpointing).
> This could probably affect Structured Streaming.
> This presents itself as an "NoOffsetForPartitionException":
> {code:java}
> 20/05/13 01:37:00 ERROR JobScheduler: Error generating jobs for time 
> 158933382 ms
> org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined 
> offset with no reset policy for partitions: [production-ad-metrics-1, 
> production-ad-metrics-2, production-ad-metrics-0, production-ad-metrics-5, 
> production-ad-metrics-6, production-ad-metrics-3, production-ad-metrics-4, 
> production-ad-metrics-7]
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.resetMissingPositions(SubscriptionState.java:391)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateFetchPositions(KafkaConsumer.java:2185)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1222)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1181)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:172)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:191)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGener

[jira] [Commented] (SPARK-32151) Kafka does not allow Partition Rebalance Handling

2020-07-16 Thread Ed Mitchell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159294#comment-17159294
 ] 

Ed Mitchell commented on SPARK-32151:
-

I could do that if I wanted to start back at the beginning or the end of the 
topic, but in this case, I would like it to restart back at the offsets defined 
by my datastore.

> Kafka does not allow Partition Rebalance Handling
> -
>
> Key: SPARK-32151
> URL: https://issues.apache.org/jira/browse/SPARK-32151
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.5
>Reporter: Ed Mitchell
>Priority: Minor
>
> When a consumer group rebalance occurs when the Spark driver is using the 
> Subscribe or Subscribe Pattern ConsumerStrategy, driver's offsets are cleared 
> when partitions are revoked and then reassigned.
> While this doesn't happen in the normal rebalance scenario of more consumers 
> joining the group (though it could), it does happen when the partition leader 
> is reelected because of a Kafka node being stopped or decommissioned.
> This seems to only occur when users specify their own offsets and do not use 
> Kafka as the persistent store of offsets (they use their own database, and 
> possibly if using checkpointing).
> This could probably affect Structured Streaming.
> This presents itself as an "NoOffsetForPartitionException":
> {code:java}
> 20/05/13 01:37:00 ERROR JobScheduler: Error generating jobs for time 
> 158933382 ms
> org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined 
> offset with no reset policy for partitions: [production-ad-metrics-1, 
> production-ad-metrics-2, production-ad-metrics-0, production-ad-metrics-5, 
> production-ad-metrics-6, production-ad-metrics-3, production-ad-metrics-4, 
> production-ad-metrics-7]
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.resetMissingPositions(SubscriptionState.java:391)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateFetchPositions(KafkaConsumer.java:2185)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1222)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1181)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:172)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:191)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:

[jira] [Created] (SPARK-32151) Kafka does not allow Partition Rebalance Handling

2020-07-01 Thread Ed Mitchell (Jira)
Ed Mitchell created SPARK-32151:
---

 Summary: Kafka does not allow Partition Rebalance Handling
 Key: SPARK-32151
 URL: https://issues.apache.org/jira/browse/SPARK-32151
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.4.5
Reporter: Ed Mitchell


When a consumer group rebalance occurs when the Spark driver is using the 
Subscribe or Subscribe Pattern ConsumerStrategy, driver's offsets are cleared 
when partitions are revoked and then reassigned.

While this doesn't happen in the normal rebalance scenario of more consumers 
joining the group (though it could), it does happen when the partition leader 
is reelected because of a Kafka node being stopped or decommissioned.

This seems to only occur when users specify their own offsets and do not use 
Kafka as the persistent store of offsets (they use their own database, and 
possibly if using checkpointing).

This could probably affect Structured Streaming.

This presents itself as an "NoOffsetForPartitionException":
{noformat}
20/05/13 01:37:00 ERROR JobScheduler: Error generating jobs for time 
158933382 
msorg.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined 
offset with no reset policy for partitions: [production-ad-metrics-1, 
production-ad-metrics-2, production-ad-metrics-0, production-ad-metrics-5, 
production-ad-metrics-6, production-ad-metrics-3, production-ad-metrics-4, 
production-ad-metrics-7]  at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.resetMissingPositions(SubscriptionState.java:391)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.updateFetchPositions(KafkaConsumer.java:2185)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1222)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1181)  
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)  
at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:172)
  at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:191)
  at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
  at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
  at scala.Option.orElse(Option.scala:289)  at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)  at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
  at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
  at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)  at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)  at 
scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)  at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)  
at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
  at scala.util.Try$.apply(Try.scala:192)  at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
  at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49){noformat}
This can be fixed by allowing the user to specify an
{code:java}
org.

[jira] [Updated] (SPARK-32151) Kafka does not allow Partition Rebalance Handling

2020-07-01 Thread Ed Mitchell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Mitchell updated SPARK-32151:

Description: 
When a consumer group rebalance occurs when the Spark driver is using the 
Subscribe or Subscribe Pattern ConsumerStrategy, driver's offsets are cleared 
when partitions are revoked and then reassigned.

While this doesn't happen in the normal rebalance scenario of more consumers 
joining the group (though it could), it does happen when the partition leader 
is reelected because of a Kafka node being stopped or decommissioned.

This seems to only occur when users specify their own offsets and do not use 
Kafka as the persistent store of offsets (they use their own database, and 
possibly if using checkpointing).

This could probably affect Structured Streaming.

This presents itself as an "NoOffsetForPartitionException":
{code:java}
20/05/13 01:37:00 ERROR JobScheduler: Error generating jobs for time 
158933382 ms
org.apache.kafka.clients.consumer.NoOffsetForPartitionException: Undefined 
offset with no reset policy for partitions: [production-ad-metrics-1, 
production-ad-metrics-2, production-ad-metrics-0, production-ad-metrics-5, 
production-ad-metrics-6, production-ad-metrics-3, production-ad-metrics-4, 
production-ad-metrics-7]
  at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.resetMissingPositions(SubscriptionState.java:391)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.updateFetchPositions(KafkaConsumer.java:2185)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1222)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1181)
  at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
  at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:172)
  at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:191)
  at 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:342)
  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:341)
  at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:336)
  at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:334)
  at scala.Option.orElse(Option.scala:289)
  at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:331)
  at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
  at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:122)
  at 
org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:121)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at 
org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:121)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
  at scala.util.Try$.apply(Try.scala:192)
  at 
org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
  at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
  at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
{code}
This can be fixed by allowing the user to specify an
{code:java}
org.apache.kafka.clients.consumer.ConsumerRebalanceListener{code}
in the KafkaConsumer#subscribe method.

The documentation for ConsumerRebalanceListener states that you can use 
KafkaConsumer

[jira] [Commented] (SPARK-30055) Allow configurable restart policy of driver and executor pods

2020-04-20 Thread Ed Mitchell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087857#comment-17087857
 ] 

Ed Mitchell commented on SPARK-30055:
-

I agree with this. Having Never defaulted limits the flexibility that allows 
Kubernetes to restart pods if they run out of memory or terminate in some 
undefined way.

You can also access logs of previously restarted containers by doing: 
{noformat}
kubectl -n  logs  --previous{noformat}
I understand not wanting to set "Always" to the Executor pod, to allow Spark to 
control graceful termination of executors, but shouldn't we at least set it to 
"OnFailure", to allow OOMKilled executors to come back up?

As far as the driver is concerned, our client mode setup has the driver pod 
living as a deployment, which means the restart policy is Always. No reason we 
can't allow Always or OnFailure in the driver restart policy imo.

> Allow configurable restart policy of driver and executor pods
> -
>
> Key: SPARK-30055
> URL: https://issues.apache.org/jira/browse/SPARK-30055
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> The current Kubernetes scheduler hard-codes the restart policy for all pods 
> to be "Never". To restart a failed application, all pods have to be deleted 
> and rescheduled, which is very slow and clears any caches the processes may 
> have built. Spark should allow a configurable restart policy for both drivers 
> and executors for immediate restart of crashed/killed drivers/executors as 
> long as the pods are not evicted. (This is not about eviction resilience, 
> that's described in this issue: SPARK-23980)
> Also, as far as I can tell, there's no reason the executors should be set to 
> never restart. Should that be configurable or should it just be changed to 
> Always?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31482) spark.kubernetes.driver.podTemplateFile Configuration not used by the job

2020-04-19 Thread Ed Mitchell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087195#comment-17087195
 ] 

Ed Mitchell commented on SPARK-31482:
-

Yeah. It's actually there but there's some unescaped HTML that was fixed in a 
later commit:

[https://github.com/apache/spark/commit/44e314edb4b86ca3a8622124539073397dbe68de#diff-b5527f236b253e0d9f5db5164bdb43e9]

 

> spark.kubernetes.driver.podTemplateFile Configuration not used by the job
> -
>
> Key: SPARK-31482
> URL: https://issues.apache.org/jira/browse/SPARK-31482
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Pradeep Misra
>Priority: Blocker
>
> Spark 3.0 - Running Spark Submit as below and point to a MinKube cluster
> {code:java}
> bin/spark-submit \
>  --master k8s://https://192.168.99.102:8443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.kubernetes.driver.podTemplateFile=../driver_1E.template \
>  --conf spark.kubernetes.executor.podTemplateFile=../executor.template \
>  --conf spark.kubernetes.container.image=spark:spark3 \
>  local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar 1
> {code}
>  
> Spark Binaries - spark-3.0.0-preview2-bin-hadoop2.7.tgz
> Driver Template - 
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
>     spark-app-id: my-custom-id
>   annotations:
>     spark-driver-cpu: 1
>     spark-driver-mem: 1
>     spark-executor-cpu: 1
>     spark-executor-mem: 1
>     spark-executor-count: 1
> spec:
>   schedulerName: spark-scheduler{code}
>  Executor Template
>  
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
>     spark-app-id: my-custom-id
> spec:
>   schedulerName: spark-scheduler{code}
> Kubernetes Pods Launched - Two Executor Pods were launched which was default
> {code:java}
> spark-pi-e608e7718f11cc69-driver   1/1     Running     0          10s
> spark-pi-e608e7718f11cc69-exec-1   1/1     Running     0          5s
> spark-pi-e608e7718f11cc69-exec-2   1/1     Running     0          5s{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31482) spark.kubernetes.driver.podTemplateFile Configuration not used by the job

2020-04-19 Thread Ed Mitchell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087150#comment-17087150
 ] 

Ed Mitchell commented on SPARK-31482:
-

I'm not sure where the bug is here TBH. I don't see anything in the 
documentation or the Spark code that implies that setting annotations on the 
Driver Pod is a valid substitute for setting "spark.executor.memory", 
"spark.executor.instances", or "spark.executor.cores" on the Spark Submit 
command.

> spark.kubernetes.driver.podTemplateFile Configuration not used by the job
> -
>
> Key: SPARK-31482
> URL: https://issues.apache.org/jira/browse/SPARK-31482
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Pradeep Misra
>Priority: Blocker
>
> Spark 3.0 - Running Spark Submit as below and point to a MinKube cluster
> {code:java}
> bin/spark-submit \
>  --master k8s://https://192.168.99.102:8443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.kubernetes.driver.podTemplateFile=../driver_1E.template \
>  --conf spark.kubernetes.executor.podTemplateFile=../executor.template \
>  --conf spark.kubernetes.container.image=spark:spark3 \
>  local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar 1
> {code}
>  
> Spark Binaries - spark-3.0.0-preview2-bin-hadoop2.7.tgz
> Driver Template - 
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
>     spark-app-id: my-custom-id
>   annotations:
>     spark-driver-cpu: 1
>     spark-driver-mem: 1
>     spark-executor-cpu: 1
>     spark-executor-mem: 1
>     spark-executor-count: 1
> spec:
>   schedulerName: spark-scheduler{code}
>  Executor Template
>  
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
>     spark-app-id: my-custom-id
> spec:
>   schedulerName: spark-scheduler{code}
> Kubernetes Pods Launched - Two Executor Pods were launched which was default
> {code:java}
> spark-pi-e608e7718f11cc69-driver   1/1     Running     0          10s
> spark-pi-e608e7718f11cc69-exec-1   1/1     Running     0          5s
> spark-pi-e608e7718f11cc69-exec-2   1/1     Running     0          5s{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org