Hi Dinakar,
Hi Mich, Posting you my comments, Right, you seem to have an on-premise Hadoop cluster of 9 physical boxes and you want to deploy spark on it. *My comment: Yes.* What spec do you have for each physical host memory and CPU and disk space? *My comment: I am not sure, of the exact numbers. but all I can say, there is enough space to deploy few more tools across 9 nodes.* --> well you can get it from infrastructure guys. In all probability you have 9 data nodes some DL380 or better with >= 64GB of RAM and Quad core or something. Space does not matter for now You can take what is known as data affinity by putting your compute layers (spark) on the same Hadoop nodes. *My comment: I am not aware of this, need to check on these lines.* --> A better term is data locality but still useful in Hadoop clusters where Spark is installed. In a nutshell "Spark is a data parallel processing framework, which means it will execute tasks as close to where the data lives as possible (i.e. minimize data transfer <https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html> ).". Having Hadoop implies that you have a YARN resource manager already plus HDFS. YARN is the most widely used resource manager on-premise for Spark. *My comment: we haven't installed spark in our cluster.* *--> *When you install Hadoop you install Hadoop Core that has these artefacts: 1. Haddop Distributed File System (HDFS) 2. MapReduce 3. YARN resource management OK what does Spark offer. Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the use of Spark becomes meaningful. In cases where (the generated) data size is going to be very large (which is often norm rather than the exception these days), the data cannot be processed and stored in Pandas data frames as these data frames store data in RAM. Then, the whole dataset from a storage like HDFS or cloud storage cannot be collected, because it will take significant time and space and probably won't fit in a single machine RAM. So you are replacing MapReduce on disk with equivalent in memory. Think of Spark as MapReduce on streiod sort of :). Additional information: ================== Agenda: 1. implementation of Apache Mesos or apache Hadoop yarn, including spark service with cluster mode. so that, if I submit pyspark or spark scala jobs having "deployment-mode = cluster", should work. --> Either way with YARN it wil work with cluster or client deployment mode. This is from article <https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/> The term D*eployment mode of Spark*, simply means that “where the driver program will be run”. There are two ways, namely; *Spark Client Mode* <https://spark.apache.org/docs/latest/running-on-yarn.html>* and **Spark Cluster Mode* <https://spark.apache.org/docs/latest/cluster-overview.html> *.* These are described below: In the Client mode, *the driver daemon runs in the node through which you submit the spark job to your cluster.* This is often done through the Edge Node. This mode is valuable when you want to use spark interactively like in our case where we would like to display high value prices in the dashboard. In the Client mode you do not want to reserve any resource from your cluster for the driver daemon In Cluster mode, *you submit the spark job to your cluster and the driver daemon is run inside your cluster and application master*. In this mode you do not get to use the spark job interactively as the client through which you submit the job is gone as soon as it successfully submits the job to cluster. You will have to reserve some resources for the driver daemon process as it will be running in your cluster. 2. This implementation should be in docker containers. --> Why? Are you going to deploy Spark on Kubernetes so whoever is insisting thinking of portability 3. I have to write Dockerfiles with Apache Hadoop with yarn and spark (opensource only), How can I do, this. --> I know it is fashionable to deploy Spark in Kubernetes (Docker inside pods) for resilence and scalability but you need to ask whoever is requesting this to justify having spark inside docker as opposed to Spark running alongside Hadoop on premise. 4. To implement Apache Mesos with spark and deployment mode = cluster, if have any kind of documentation or weblinks or your knowledge, could you give that to me. really I will help me a lot. 5. We have services like minio, trino, superset, jupyter, and so on. OK we cross the bridge for 4 and 5 when there is justification for 2. The alternative is for Spark to be installed on each or few designated nodes of hadoop on your physical hosts on premise. Kindly help me, to accomplish this. let me know, what else you need. Thanks, Dinakar On Sun, Jul 25, 2021 at 10:35 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Right you seem to have an on-premise hadoop cluster of 9 physical boxes > and you want to deploy spark on it. > > What spec do you have for each physical host memory and CPU and disk space? > > You can take what is known as data affinity by putting your compute layers > (spark) on the same hadoop nodes. > > Having hadoop implies that you have a YARN resource manager already plus > HDFS. YARN is the most widely used resource manager on-premise for Spark. > > Provide some additional info and we go from there. . > > HTH > > > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 24 Jul 2021 at 13:46, Dinakar Chennubotla < > chennu.bigd...@gmail.com> wrote: > >> Hi All, >> >> I am Dinakar, Hadoop admin, >> could someone help me here, >> >> 1. I have a DEV-POC task to do, >> 2. Need to Installing Distributed apache-spark cluster with Cluster mode >> on Docker containers. >> 3. with Scalable spark-worker containers. >> 4. we have a 9 node cluster with some other services or tools. >> >> Thanks, >> Dinakar >> >