Hi, I would still not build any custom solution, and if in GCP use serverless Dataproc. I think that it is always better to be hands on with AWS Glue before commenting on it.
Regards, Gourav Sengupta On Mon, Feb 14, 2022 at 11:18 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Good question. However, we ought to look at what options we have so to > speak. > > Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on > Dataflow > > > Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is > in use at many organizations, I have deployed it extensively. It is > infrastructure as a service provided including Spark, Hadoop and other > artefacts. You have to manage cluster creation, automate cluster creation > and tear down, submitting jobs etc. However, it is another stack that needs > to be managed. It now has autoscaling > <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling> > (enables cluster worker VM autoscaling ) policy as well. > > Spark on GKE > <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> > is something newer. Worth adding that the Spark DEV team are working hard > to improve the performance of Spark on Kubernetes, for example, through > Support > for Customized Kubernetes Scheduler > <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>. > As I explained in the first thread, Spark on Kubernetes relies on > containerisation. Containers make applications more portable. Moreover, > they simplify the packaging of dependencies, especially with PySpark and > enable repeatable and reliable build workflows which is cost effective. > They also reduce the overall devops load and allow one to iterate on the > code faster. From a purely cost perspective it would be cheaper with Docker > *as > you can share resources* with your other services. You can create Spark > docker with different versions of Spark, Scala, Java, OS etc. That docker > file is portable. Can be used on Prem, AWS, GCP etc in container registries > and devops and data science people can share it as well. Built once used by > many. Kubernetes with autopilo > <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t > helps scale the nodes of the Kubernetes cluster depending on the load. *That > is what I am currently looking into*. > > With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I > believe is similar to AWS Glue > <https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>, > it is a managed service for executing data processing patterns. Patterns or > pipelines are built with the Apache Beam SDK > <https://beam.apache.org/documentation/runners/spark/>, which is an open > source programming model that supports Java, Python and GO. It enables > batch and streaming pipelines. You create your pipelines with an Apache > Beam program and then run them on the Dataflow service. The Apache Spark > Runner > <https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.> > can be used to execute Beam pipelines using Spark. When you run a job on > Dataflow, it spins up a cluster of virtual machines, distributes the tasks > in the job to the VMs, and dynamically scales the cluster based on how the > job is performing. As I understand both iterative processing and notebooks > plus Machine learning with Spark ML are not currently supported by Dataflow > > So we have three choices here. If you are migrating from on-prem > Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the > same look and feel. If you want to use microservices and containers in your > event driven architecture, you can adopt docker images that run on > Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is > probably best suited for green-field projects. Less operational > overhead, unified approach for batch and streaming pipelines. > > *So as ever your mileage varies*. If you want to migrate from your > existing Hadoop/Spark cluster to GCP, or take advantage of your existing > workforce, choose Dataproc or GKE. In many cases, a big consideration is > that one already has a codebase written against a particular framework, and > one just wants to deploy it on the GCP, so even if, say, the Beam > programming mode/dataflow is superior to Hadoop, someone with a lot of > Hadoop code might still choose Dataproc or GDE for the time being, rather > than rewriting their code on Beam to run on Dataflow. > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi, >> may be this is useful in case someone is testing SPARK in containers for >> developing SPARK. >> >> *From a production scale work point of view:* >> But if I am in AWS, I will just use GLUE if I want to use containers for >> SPARK, without massively increasing my costs for operations unnecessarily. >> >> Also, in case I am not wrong, GCP already has SPARK running in serverless >> mode. Personally I would never create the overhead of additional costs and >> issues to my clients of deploying SPARK when those solutions are already >> available by Cloud vendors. Infact, that is one of the precise reasons why >> people use cloud - to reduce operational costs. >> >> Sorry, just trying to understand what is the scope of this work. >> >> >> Regards, >> Gourav Sengupta >> >> On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> The equivalent of Google GKE autopilot >>> <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> >>> in >>> AWS is AWS Fargate <https://aws.amazon.com/fargate/> >>> >>> >>> I have not used the AWS Fargate so I can only mension Google's GKE >>> Autopilot. >>> >>> >>> This is developed from the concept of containerization and >>> microservices. In the standard mode of creating a GKE cluster users can >>> customize their configurations based on the requirements, GKE manages the >>> control plane and users manually provision and manage their node >>> infrastructure. So you choose your hardware type and memory/CPU where your >>> spark containers will be running and they will be shown as VM hosts in your >>> account. In GKE Autopilot mode, GKE manages the nodes, pre-configures >>> the cluster with adds-on for auto-scaling, auto-upgrades, maintenance, Day >>> 2 operations and security hardening. So there is a lot there. You don't >>> choose your nodes and their sizes. You are effectively paying for the pods >>> you use. >>> >>> >>> Within spark-submit, you still need to specify the number of executors, >>> driver and executor memory plus cores for each driver and executor when >>> doing spark-submit. The theory is that the k8s cluster will deploy suitable >>> nodes and will create enough pods on those nodes. With the standard k8s >>> cluster you choose your nodes and you ensure that one core on each node is >>> reserved for the OS itself. Otherwise if you allocate all cores to spark >>> with --conf spark.executor.cores, you will receive this error >>> >>> >>> kubctl describe pods -n spark >>> >>> ... >>> >>> Events: >>> >>> Type Reason Age From >>> Message >>> >>> ---- ------ ---- ---- >>> ------- >>> >>> Warning FailedScheduling 9s (x17 over 15m) default-scheduler >>> 0/3 nodes are available: 3 Insufficient cpu. >>> >>> So with the standard k8s you have a choice of selecting your core sizes. >>> With autopilot this node selection is left to autopilot to deploy suitable >>> nodes and this will be a trial and error at the start (to get the >>> configuration right). You may be lucky if the history of executions are >>> kept current and the same job can be repeated. However, in my experience, >>> to procedure the driver pod in "running state" is expensive timewise and >>> without an executor in running state, there is no chance of spark job doing >>> anything >>> >>> >>> NAME READY STATUS RESTARTS >>> AGE >>> >>> randomdatabigquery-cebab77eea6de971-exec-1 0/1 Pending 0 >>> 31s >>> >>> randomdatabigquery-cebab77eea6de971-exec-2 0/1 Pending 0 >>> 31s >>> >>> randomdatabigquery-cebab77eea6de971-exec-3 0/1 Pending 0 >>> 31s >>> >>> randomdatabigquery-cebab77eea6de971-exec-4 0/1 Pending 0 >>> 31s >>> >>> randomdatabigquery-cebab77eea6de971-exec-5 0/1 Pending 0 >>> 31s >>> >>> randomdatabigquery-cebab77eea6de971-exec-6 0/1 Pending 0 >>> 31s >>> >>> sparkbq-37405a7eea6b9468-driver 1/1 Running 0 >>> 3m4s >>> >>> >>> NAME READY STATUS >>> RESTARTS AGE >>> >>> randomdatabigquery-cebab77eea6de971-exec-6 0/1 ContainerCreating >>> 0 112s >>> >>> sparkbq-37405a7eea6b9468-driver 1/1 Running >>> 0 4m25s >>> >>> NAME READY STATUS RESTARTS >>> AGE >>> >>> randomdatabigquery-cebab77eea6de971-exec-6 1/1 Running 0 >>> 114s >>> >>> sparkbq-37405a7eea6b9468-driver 1/1 Running 0 >>> 4m27s >>> >>> Basically I told Spak to have 6 executors but could only bring into >>> running state one executor after the driver pod spinning for 4 minutes. >>> >>> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring >>> K8S client using current context from users K8S config file >>> >>> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of >>> spark.dynamicAllocation.initialExecutors, >>> spark.dynamicAllocation.minExecutors and spark.executor.instances >>> >>> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 >>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0. >>> >>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not >>> enabled, skipping shutdown script >>> >>> 22/02/11 20:16:20 INFO Utils: Successfully started service >>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. >>> >>> 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on >>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 >>> >>> 22/02/11 20:16:20 INFO BlockManager: Using >>> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication >>> policy >>> >>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager >>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, >>> None) >>> >>> 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block >>> manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB >>> RAM, BlockManagerId(driver, >>> sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None) >>> >>> 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager >>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, >>> None) >>> >>> 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager: >>> BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, >>> None) >>> >>> 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of >>> spark.dynamicAllocation.initialExecutors, >>> spark.dynamicAllocation.minExecutors and spark.executor.instances >>> >>> 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation >>> without a shuffle service is an experimental feature. >>> >>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not >>> enabled, skipping shutdown script >>> >>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not >>> enabled, skipping shutdown script >>> >>> 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 >>> executors from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3. >>> >>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not >>> enabled, skipping shutdown script >>> >>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not >>> enabled, skipping shutdown script >>> >>> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not >>> enabled, skipping shutdown script >>> >>> 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: >>> SchedulerBackend is ready for scheduling beginning after waiting >>> maxRegisteredResourcesWaitingTime: 30000000000(ns) >>> >>> 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir >>> ('null') to the value of spark.sql.warehouse.dir >>> ('file:/opt/spark/work-dir/spark-warehouse'). >>> >>> 22/02/11 20:16:49 INFO SharedState: Warehouse path is >>> 'file:/opt/spark/work-dir/spark-warehouse'. >>> >>> OK there is a lot to digest here and I appreciate feedback from other >>> members that have experimented with GKE autopilot or AWS Fargate or are >>> familiar with k8s. >>> >>> Thanks >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>