Hi, sorry in case it appeared otherwise, Mich's takes are super interesting. Just that while applying solutions on commercial undertakings things are quite different from research/ development scenarios .
Regards, Gourav Sengupta On Mon, Feb 14, 2022 at 5:02 PM ashok34...@yahoo.com.INVALID <ashok34...@yahoo.com.invalid> wrote: > Thanks Mich. Very insightful. > > > AK > On Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > > > Good question. However, we ought to look at what options we have so to > speak. > > Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on > Dataflow > > > Spark on DataProc <https://cloud.google.com/dataproc> is proven and it is > in use at many organizations, I have deployed it extensively. It is > infrastructure as a service provided including Spark, Hadoop and other > artefacts. You have to manage cluster creation, automate cluster creation > and tear down, submitting jobs etc. However, it is another stack that needs > to be managed. It now has autoscaling > <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling> > (enables cluster worker VM autoscaling ) policy as well. > > Spark on GKE > <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> > is something newer. Worth adding that the Spark DEV team are working hard > to improve the performance of Spark on Kubernetes, for example, through > Support > for Customized Kubernetes Scheduler > <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>. > As I explained in the first thread, Spark on Kubernetes relies on > containerisation. Containers make applications more portable. Moreover, > they simplify the packaging of dependencies, especially with PySpark and > enable repeatable and reliable build workflows which is cost effective. > They also reduce the overall devops load and allow one to iterate on the > code faster. From a purely cost perspective it would be cheaper with Docker > *as > you can share resources* with your other services. You can create Spark > docker with different versions of Spark, Scala, Java, OS etc. That docker > file is portable. Can be used on Prem, AWS, GCP etc in container registries > and devops and data science people can share it as well. Built once used by > many. Kubernetes with autopilo > <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#:~:text=Autopilot%20is%20a%20new%20mode,and%20yield%20higher%20workload%20availability.>t > helps scale the nodes of the Kubernetes cluster depending on the load. *That > is what I am currently looking into*. > > With regard to Dataflow <https://cloud.google.com/dataflow/docs>, which I > believe is similar to AWS Glue > <https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc>, > it is a managed service for executing data processing patterns. Patterns or > pipelines are built with the Apache Beam SDK > <https://beam.apache.org/documentation/runners/spark/>, which is an open > source programming model that supports Java, Python and GO. It enables > batch and streaming pipelines. You create your pipelines with an Apache > Beam program and then run them on the Dataflow service. The Apache Spark > Runner > <https://beam.apache.org/documentation/runners/spark/#:~:text=The%20Apache%20Spark%20Runner%20can,Beam%20pipelines%20using%20Apache%20Spark.&text=The%20Spark%20Runner%20executes%20Beam,same%20security%20features%20Spark%20provides.> > can be used to execute Beam pipelines using Spark. When you run a job on > Dataflow, it spins up a cluster of virtual machines, distributes the tasks > in the job to the VMs, and dynamically scales the cluster based on how the > job is performing. As I understand both iterative processing and notebooks > plus Machine learning with Spark ML are not currently supported by Dataflow > > So we have three choices here. If you are migrating from on-prem > Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the > same look and feel. If you want to use microservices and containers in your > event driven architecture, you can adopt docker images that run on > Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is > probably best suited for green-field projects. Less operational > overhead, unified approach for batch and streaming pipelines. > > *So as ever your mileage varies*. If you want to migrate from your > existing Hadoop/Spark cluster to GCP, or take advantage of your existing > workforce, choose Dataproc or GKE. In many cases, a big consideration is > that one already has a codebase written against a particular framework, and > one just wants to deploy it on the GCP, so even if, say, the Beam > programming mode/dataflow is superior to Hadoop, someone with a lot of > Hadoop code might still choose Dataproc or GDE for the time being, rather > than rewriting their code on Beam to run on Dataflow. > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi, > may be this is useful in case someone is testing SPARK in containers for > developing SPARK. > > *From a production scale work point of view:* > But if I am in AWS, I will just use GLUE if I want to use containers for > SPARK, without massively increasing my costs for operations unnecessarily. > > Also, in case I am not wrong, GCP already has SPARK running in serverless > mode. Personally I would never create the overhead of additional costs and > issues to my clients of deploying SPARK when those solutions are already > available by Cloud vendors. Infact, that is one of the precise reasons why > people use cloud - to reduce operational costs. > > Sorry, just trying to understand what is the scope of this work. > > > Regards, > Gourav Sengupta > > On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > The equivalent of Google GKE autopilot > <https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview> > in > AWS is AWS Fargate <https://aws.amazon.com/fargate/> > > > I have not used the AWS Fargate so I can only mension Google's GKE > Autopilot. > > > This is developed from the concept of containerization and microservices. > In the standard mode of creating a GKE cluster users can customize their > configurations based on the requirements, GKE manages the control plane and > users manually provision and manage their node infrastructure. So you > choose your hardware type and memory/CPU where your spark containers will > be running and they will be shown as VM hosts in your account. In GKE > Autopilot mode, GKE manages the nodes, pre-configures the cluster with > adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and > security hardening. So there is a lot there. You don't choose your nodes > and their sizes. You are effectively paying for the pods you use. > > > Within spark-submit, you still need to specify the number of executors, > driver and executor memory plus cores for each driver and executor when > doing spark-submit. The theory is that the k8s cluster will deploy suitable > nodes and will create enough pods on those nodes. With the standard k8s > cluster you choose your nodes and you ensure that one core on each node is > reserved for the OS itself. Otherwise if you allocate all cores to spark > with --conf spark.executor.cores, you will receive this error > > > kubctl describe pods -n spark > > ... > > Events: > > Type Reason Age From > Message > > ---- ------ ---- ---- > ------- > > Warning FailedScheduling 9s (x17 over 15m) default-scheduler 0/3 > nodes are available: 3 Insufficient cpu. > > So with the standard k8s you have a choice of selecting your core sizes. > With autopilot this node selection is left to autopilot to deploy suitable > nodes and this will be a trial and error at the start (to get the > configuration right). You may be lucky if the history of executions are > kept current and the same job can be repeated. However, in my experience, > to procedure the driver pod in "running state" is expensive timewise and > without an executor in running state, there is no chance of spark job doing > anything > > > NAME READY STATUS RESTARTS > AGE > > randomdatabigquery-cebab77eea6de971-exec-1 0/1 Pending 0 > 31s > > randomdatabigquery-cebab77eea6de971-exec-2 0/1 Pending 0 > 31s > > randomdatabigquery-cebab77eea6de971-exec-3 0/1 Pending 0 > 31s > > randomdatabigquery-cebab77eea6de971-exec-4 0/1 Pending 0 > 31s > > randomdatabigquery-cebab77eea6de971-exec-5 0/1 Pending 0 > 31s > > randomdatabigquery-cebab77eea6de971-exec-6 0/1 Pending 0 > 31s > > sparkbq-37405a7eea6b9468-driver 1/1 Running 0 > 3m4s > > > NAME READY STATUS > RESTARTS AGE > > randomdatabigquery-cebab77eea6de971-exec-6 0/1 ContainerCreating > 0 112s > > sparkbq-37405a7eea6b9468-driver 1/1 Running > 0 4m25s > > NAME READY STATUS RESTARTS > AGE > > randomdatabigquery-cebab77eea6de971-exec-6 1/1 Running 0 > 114s > > sparkbq-37405a7eea6b9468-driver 1/1 Running 0 > 4m27s > > Basically I told Spak to have 6 executors but could only bring into > running state one executor after the driver pod spinning for 4 minutes. > > 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > > 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of > spark.dynamicAllocation.initialExecutors, > spark.dynamicAllocation.minExecutors and spark.executor.instances > > 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors > from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0. > > 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not > enabled, skipping shutdown script > > 22/02/11 20:16:20 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. > > 22/02/11 20:16:20 INFO NettyBlockTransferService: Server created on > sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 > > 22/02/11 20:16:20 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > > 22/02/11 20:16:20 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, > None) > > 22/02/11 20:16:20 INFO BlockManagerMasterEndpoint: Registering block > manager sparkbq-37405a7eea6b9468-driver-svc.spark.svc:7079 with 366.3 MiB > RAM, BlockManagerId(driver, > sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, None) > > 22/02/11 20:16:20 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, > None) > > 22/02/11 20:16:20 INFO BlockManager: Initialized BlockManager: > BlockManagerId(driver, sparkbq-37405a7eea6b9468-driver-svc.spark.svc, 7079, > None) > > 22/02/11 20:16:20 INFO Utils: Using initial executors = 6, max of > spark.dynamicAllocation.initialExecutors, > spark.dynamicAllocation.minExecutors and spark.executor.instances > > 22/02/11 20:16:20 WARN ExecutorAllocationManager: Dynamic allocation > without a shuffle service is an experimental feature. > > 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not > enabled, skipping shutdown script > > 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not > enabled, skipping shutdown script > > 22/02/11 20:16:20 INFO ExecutorPodsAllocator: Going to request 3 executors > from Kubernetes for ResourceProfile Id: 0, target: 6 running: 3. > > 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not > enabled, skipping shutdown script > > 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not > enabled, skipping shutdown script > > 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not > enabled, skipping shutdown script > > 22/02/11 20:16:49 INFO KubernetesClusterSchedulerBackend: SchedulerBackend > is ready for scheduling beginning after waiting > maxRegisteredResourcesWaitingTime: 30000000000(ns) > > 22/02/11 20:16:49 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/opt/spark/work-dir/spark-warehouse'). > > 22/02/11 20:16:49 INFO SharedState: Warehouse path is > 'file:/opt/spark/work-dir/spark-warehouse'. > > OK there is a lot to digest here and I appreciate feedback from other > members that have experimented with GKE autopilot or AWS Fargate or are > familiar with k8s. > > Thanks > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > >