Hi,

I've been playing around with Spark Kubernetes deployments over the past week 
and I'm curious to know why Spark deploys as a driver pod that creates more 
worker pods.

I've read that it's normal to use Kubernetes Deployments to create a 
distributed service, so I am wondering why Spark just creates Pods. I suppose 
the driver program
is 'the odd one out' so it doesn't belong in a Deployment or ReplicaSet, but 
maybe the workers could be Deployment? Is this something to do with data 
locality?

I have tried Streaming pipelines on Kubernetes yet, are these also Pods that 
create Pods rather than Deployments? It seems more important for a streaming 
pipeline to be 'durable'[1] as the Kubernetes documentation might say.

I ask this question partly because the Kubernetes deployment of Spark is still 
experimental and I am wondering whether this aspect of the deployment might 
change.

I had a look at the Flink[2] documentation and it does seem to use Deployments 
however these seem to be a lightweight job/task manager that accepts Flink 
jobs. It sounds actually like running a lightweight version YARN inside 
containers on Kubernetes.


Thanks,


Frank

[1] 
https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html

Reply via email to