Re: Spark Kubernetes Architecture: Deployments vs Pods that create Pods

2019-01-30 Thread Li Gao
Hi Wilson,

As Yinan well said, for batch jobs with dynamic scaling requirements and
communication between driver and executor, it does not fit into the service
oriented Deployment paradigm of k8s. Thus we have the need to abstract
these spark specific differences to k8s CRD and CRD controller to manage
the lifecycle of spark batch on k8s:
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator. The CRD makes
the spark job more k8s compliant and repeatable.

Like you discovered, Deployment is typically used for job server type of
services.

-Li


On Tue, Jan 29, 2019 at 1:49 PM Yinan Li  wrote:

> Hi Wilson,
>
> The behavior of a Deployment doesn't fit with the way Spark executor pods
> are run and managed. For example, executor pods are created and deleted per
> the requests from the driver dynamically and normally they run to
> completion. A Deployment assumes uniformity and statelessness of the set of
> Pods it manages, which is not necessarily the case for Spark executors. For
> example, executor Pods have unique executor IDs. Dynamic resource
> allocation doesn't play well with a Deployment as scaling or shrinking the
> number of executor Pods requires a rolling update with a Deployment, which
> means restarting all the executor Pods. In the Kubernetes mode, the driver
> is effectively a custom controller of executor Pods that adds or deletes
> Pods per requests from the driver, and watches the status of the Pods.
>
> The way Flink on Kubernetes works, as you said, is basically running the
> Flink job/task managers using Deployments. A equivalent is running a
> standalone Spark cluster on top of Kubernetes. If you want auto-restart for
> Spark streaming jobs, I would suggest you take a look at the K8S Spark
> Operator .
>
> On Tue, Jan 29, 2019 at 5:53 AM WILSON Frank <
> frank.wil...@uk.thalesgroup.com> wrote:
>
>> Hi,
>>
>>
>>
>> I’ve been playing around with Spark Kubernetes deployments over the past
>> week and I’m curious to know why Spark deploys as a driver pod that creates
>> more worker pods.
>>
>>
>>
>> I’ve read that it’s normal to use Kubernetes Deployments to create a
>> distributed service, so I am wondering why Spark just creates Pods. I
>> suppose the driver program
>>
>> is ‘the odd one out’ so it doesn’t belong in a Deployment or ReplicaSet,
>> but maybe the workers could be Deployment? Is this something to do with
>> data locality?
>>
>>
>>
>> I have tried Streaming pipelines on Kubernetes yet, are these also Pods
>> that create Pods rather than Deployments? It seems more important for a
>> streaming pipeline to be ‘durable’[1] as the Kubernetes documentation might
>> say.
>>
>>
>>
>> I ask this question partly because the Kubernetes deployment of Spark is
>> still experimental and I am wondering whether this aspect of the deployment
>> might change.
>>
>>
>>
>> I had a look at the Flink[2] documentation and it does seem to use
>> Deployments however these seem to be a lightweight job/task manager that
>> accepts Flink jobs. It sounds actually like running a lightweight version
>> YARN inside containers on Kubernetes.
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>> Frank
>>
>>
>>
>> [1]
>> https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
>>
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html
>>
>


Re: Spark Kubernetes Architecture: Deployments vs Pods that create Pods

2019-01-29 Thread Yinan Li
Hi Wilson,

The behavior of a Deployment doesn't fit with the way Spark executor pods
are run and managed. For example, executor pods are created and deleted per
the requests from the driver dynamically and normally they run to
completion. A Deployment assumes uniformity and statelessness of the set of
Pods it manages, which is not necessarily the case for Spark executors. For
example, executor Pods have unique executor IDs. Dynamic resource
allocation doesn't play well with a Deployment as scaling or shrinking the
number of executor Pods requires a rolling update with a Deployment, which
means restarting all the executor Pods. In the Kubernetes mode, the driver
is effectively a custom controller of executor Pods that adds or deletes
Pods per requests from the driver, and watches the status of the Pods.

The way Flink on Kubernetes works, as you said, is basically running the
Flink job/task managers using Deployments. A equivalent is running a
standalone Spark cluster on top of Kubernetes. If you want auto-restart for
Spark streaming jobs, I would suggest you take a look at the K8S Spark
Operator .

On Tue, Jan 29, 2019 at 5:53 AM WILSON Frank <
frank.wil...@uk.thalesgroup.com> wrote:

> Hi,
>
>
>
> I’ve been playing around with Spark Kubernetes deployments over the past
> week and I’m curious to know why Spark deploys as a driver pod that creates
> more worker pods.
>
>
>
> I’ve read that it’s normal to use Kubernetes Deployments to create a
> distributed service, so I am wondering why Spark just creates Pods. I
> suppose the driver program
>
> is ‘the odd one out’ so it doesn’t belong in a Deployment or ReplicaSet,
> but maybe the workers could be Deployment? Is this something to do with
> data locality?
>
>
>
> I have tried Streaming pipelines on Kubernetes yet, are these also Pods
> that create Pods rather than Deployments? It seems more important for a
> streaming pipeline to be ‘durable’[1] as the Kubernetes documentation might
> say.
>
>
>
> I ask this question partly because the Kubernetes deployment of Spark is
> still experimental and I am wondering whether this aspect of the deployment
> might change.
>
>
>
> I had a look at the Flink[2] documentation and it does seem to use
> Deployments however these seem to be a lightweight job/task manager that
> accepts Flink jobs. It sounds actually like running a lightweight version
> YARN inside containers on Kubernetes.
>
>
>
>
>
> Thanks,
>
>
>
>
>
> Frank
>
>
>
> [1]
> https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
>
> [2]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html
>


Spark Kubernetes Architecture: Deployments vs Pods that create Pods

2019-01-29 Thread WILSON Frank
Hi,

I've been playing around with Spark Kubernetes deployments over the past week 
and I'm curious to know why Spark deploys as a driver pod that creates more 
worker pods.

I've read that it's normal to use Kubernetes Deployments to create a 
distributed service, so I am wondering why Spark just creates Pods. I suppose 
the driver program
is 'the odd one out' so it doesn't belong in a Deployment or ReplicaSet, but 
maybe the workers could be Deployment? Is this something to do with data 
locality?

I have tried Streaming pipelines on Kubernetes yet, are these also Pods that 
create Pods rather than Deployments? It seems more important for a streaming 
pipeline to be 'durable'[1] as the Kubernetes documentation might say.

I ask this question partly because the Kubernetes deployment of Spark is still 
experimental and I am wondering whether this aspect of the deployment might 
change.

I had a look at the Flink[2] documentation and it does seem to use Deployments 
however these seem to be a lightweight job/task manager that accepts Flink 
jobs. It sounds actually like running a lightweight version YARN inside 
containers on Kubernetes.


Thanks,


Frank

[1] 
https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html