Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Gourav Sengupta Mon, 14 Feb 2022 10:03:52 -0800

Hi,

sorry in case it appeared otherwise, Mich's takes are super interesting.
Just that while applying solutions on commercial undertakings things are
quite different from research/ development scenarios .




Regards,
Gourav Sengupta





On Mon, Feb 14, 2022 at 5:02 PM ashok34...@yahoo.com.INVALID
<ashok34...@yahoo.com.invalid> wrote:

> Thanks Mich. Very insightful.
>
>
> AK
> On Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Good question. However, we ought to look at what options we have so to
> speak.
>
> Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on
> Dataflow
>
>
> Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is
> in use at many organizations, I have deployed it extensively. It is
> infrastructure as a service provided including Spark, Hadoop and other
> artefacts. You have to manage cluster creation, automate cluster creation
> and tear down, submitting jobs etc. However, it is another stack that needs
> to be managed. It now has autoscaling
> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling>
> (enables cluster worker VM autoscaling ) policy as well.
>
> Spark on GKE
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>
> is something newer. Worth adding that the Spark DEV team are working hard
> to improve the performance of Spark on Kubernetes, for example, through 
> Support
> for Customized Kubernetes Scheduler
> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>.
> As I explained in the first thread, Spark on Kubernetes relies on
> containerisation. Containers make applications more portable. Moreover,
> they simplify the packaging of dependencies, especially with PySpark and
> enable repeatable and reliable build workflows which is cost effective.
> They also reduce the overall devops load and allow one to iterate on the
> code faster. From a purely cost perspective it would be cheaper with Docker 
> *as
> you can share resources* with your other services. You can create Spark
> docker with different versions of Spark, Scala, Java, OS etc. That docker
> file is portable. Can be used on Prem, AWS, GCP etc in container registries
> and devops and data science people can share it as well. Built once used by
> many. Kubernetes with autopilo
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t
> helps scale the nodes of the Kubernetes cluster depending on the load. *That
> is what I am currently looking into*.
>
> With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I
> believe is similar to AWS Glue
> <https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>,
> it is a managed service for executing data processing patterns. Patterns or
> pipelines are built with the Apache Beam SDK
> <https://beam.apache.org/documentation/runners/spark/>, which is an open
> source programming model that supports Java, Python and GO. It enables
> batch and streaming pipelines. You create your pipelines with an Apache
> Beam program and then run them on the Dataflow service. The Apache Spark
> Runner
> <https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.>
> can be used to execute Beam pipelines using Spark. When you run a job on
> Dataflow, it spins up a cluster of virtual machines, distributes the tasks
> in the job to the VMs, and dynamically scales the cluster based on how the
> job is performing. As I understand both iterative processing and notebooks
> plus Machine learning with Spark ML are not currently supported by Dataflow
>
> So we have three choices here. If you are migrating from on-prem
> Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
> same look and feel. If you want to use microservices and containers in your
> event driven architecture, you can adopt docker images that run on
> Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
> probably best suited for green-field projects.  Less operational
> overhead, unified approach for batch and streaming pipelines.
>
> *So as ever your mileage varies*. If you want to migrate from your
> existing Hadoop/Spark cluster to GCP, or take advantage of your existing
> workforce, choose Dataproc or GKE. In many cases, a big consideration is
> that one already has a codebase written against a particular framework, and
> one just wants to deploy it on the GCP, so even if, say, the Beam
> programming mode/dataflow is superior to Hadoop, someone with a lot of
> Hadoop code might still choose Dataproc or GDE for the time being, rather
> than rewriting their code on Beam to run on Dataflow.
>
>  HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi,
> may be this is useful in case someone is testing SPARK in containers for
> developing SPARK.
>
> *From a production scale work point of view:*
> But if I am in AWS, I will just use GLUE if I want to use containers for
> SPARK, without massively increasing my costs for operations unnecessarily.
>
> Also, in case I am not wrong, GCP already has SPARK running in serverless
> mode.  Personally I would never create the overhead of additional costs and
> issues to my clients of deploying SPARK when those solutions are already
> available by Cloud vendors. Infact, that is one of the precise reasons why
> people use cloud - to reduce operational costs.
>
> Sorry, just trying to understand what is the scope of this work.
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> The equivalent of Google GKE autopilot
> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> 
> in
> AWS is AWS Fargate <https://aws.amazon.com/fargate/>
>
>
> I have not used the AWS Fargate so I can only mension Google's GKE
> Autopilot.
>
>
> This is developed from the concept of containerization and microservices.
> In the standard mode of creating a GKE cluster users can customize their
> configurations based on the requirements, GKE manages the control plane and
> users manually provision and manage their node infrastructure. So you
> choose your hardware type and memory/CPU where your spark containers will
> be running and they will be shown as VM hosts in your account. In GKE
> Autopilot mode, GKE manages the nodes, pre-configures the cluster with
> adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
> security hardening. So there is a lot there. You don't choose your nodes
> and their sizes. You are effectively paying for the pods you use.
>
>
> Within spark-submit, you still need to specify the number of executors,
> driver and executor memory plus cores for each driver and executor when
> doing spark-submit. The theory is that the k8s cluster will deploy suitable
> nodes and will create enough pods on those nodes. With the standard k8s
> cluster you choose your nodes and you ensure that one core on each node is
> reserved for the OS itself. Otherwise if you allocate all cores to spark
> with --conf spark.executor.cores, you will receive this error
>
>
> kubctl describe pods -n spark
>
> ...
>
> Events:
>
>   Type     Reason             Age                 From
> Message
>
>   ----     ------             ----                ----
> -------
>
>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3
> nodes are available: 3 Insufficient cpu.
>
> So with the standard k8s you have a choice of selecting your core sizes.
> With autopilot this node selection is left to autopilot to deploy suitable
> nodes and this will be a trial and error at the start (to get the
> configuration right). You may be lucky if the history of executions are
> kept current and the same job can be repeated. However, in my experience,
> to procedure the driver pod in "running state" is expensive timewise and
> without an executor in running state, there is no chance of spark job doing
> anything
>
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-1   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-2   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-3   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-4   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-5   0/1     Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1     Pending   0
> 31s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
> 3m4s
>
>
> NAME                                         READY   STATUS
> RESTARTS   AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1     ContainerCreating
>  0          112s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running
>  0          4m25s
>
> NAME                                         READY   STATUS    RESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   1/1     Running   0
> 114s
>
> sparkbq-37405a7eea6b9468-driver              1/1     Running   0
> 4m27s
>
> Basically I told Spak to have 6 executors but could only bring into
> running state one executor after the driver pod spinning for 4 minutes.
>
> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
> client using current context from users K8S config file
>
> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
>
> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on
> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079
>
> 22/02/11 20:16:20 INFO BlockManager: Using
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
> policy
>
> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block
> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB
> RAM, BlockManagerId(driver,
> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None)
>
> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079,
> None)
>
> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation
> without a shuffle service is an experimental feature.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown script
>
> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend
> is ready for scheduling beginning after waiting
> maxRegisteredResourcesWaitingTime: 30000000000(ns)
>
> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir
> ('null') to the value of spark.sql.warehouse.dir
> ('file:/opt/spark/work-dir/spark-warehouse').
>
> 22/02/11 20:16:49 INFO SharedState: Warehouse path is
> 'file:/opt/spark/work-dir/spark-warehouse'.
>
> OK there is a lot to digest here and I appreciate feedback from other
> members that have experimented with GKE autopilot or AWS Fargate or are
> familiar with k8s.
>
> Thanks
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Reply via email to