Hi Mich, Before going team and conveying them, the things you said. I want to have a call with you, |can I have a call with you for few minutes to discuss the same.
Thanks, dinakar On Mon, Jul 26, 2021 at 1:43 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Dinakar, > > > Hi Mich, > > Posting you my comments, > > Right, you seem to have an on-premise Hadoop cluster of 9 physical boxes > and you want to deploy spark on it. > *My comment: Yes.* > > What spec do you have for each physical host memory and CPU and disk space? > *My comment: I am not sure, of the exact numbers. but all I can say, > there is enough space to deploy few more tools across 9 nodes.* > > --> well you can get it from infrastructure guys. In all probability you > have 9 data nodes some DL380 or better with >= 64GB of RAM and Quad core or > something. Space does not matter for now > > You can take what is known as data affinity by putting your compute layers > (spark) on the same Hadoop nodes. > *My comment: I am not aware of this, need to check on these lines.* > > --> A better term is data locality but still useful in Hadoop clusters > where Spark is installed. In a nutshell "Spark is a data parallel > processing framework, which means it will execute tasks as close to where > the data lives as possible (i.e. minimize data transfer > <https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html> > ).". > > > > Having Hadoop implies that you have a YARN resource manager already plus > HDFS. YARN is the most widely used resource manager on-premise for Spark. > *My comment: we haven't installed spark in our cluster.* > > *--> *When you install Hadoop you install Hadoop Core that has these > artefacts: > > > 1. Haddop Distributed File System (HDFS) > 2. MapReduce > 3. YARN resource management > > OK what does Spark offer. > > Generally, parallel architecture comes into play when the data size is > significantly large which cannot be handled on a single machine, hence, the > use of Spark becomes meaningful. > In cases where (the generated) data size is going to be very large (which > is often norm rather than the exception these days), the data cannot be > processed and stored in Pandas data frames as these data frames store data > in RAM. Then, the whole dataset from a storage like HDFS or cloud storage > cannot be collected, because it will take significant time and space and > probably won't fit in a single machine RAM. > > So you are replacing MapReduce on disk with equivalent in memory. Think of > Spark as MapReduce on streiod sort of :). > > > > Additional information: > ================== > Agenda: > 1. implementation of Apache Mesos or apache Hadoop yarn, including spark > service with cluster mode. > so that, if I submit pyspark or spark scala jobs having "deployment-mode = > cluster", should work. > > --> Either way with YARN it wil work with cluster or client deployment > mode. This is from article > <https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/> > The term D*eployment mode of Spark*, simply means that “where the driver > program will be run”. There are two ways, namely; *Spark Client Mode* > <https://spark.apache.org/docs/latest/running-on-yarn.html>* and **Spark > Cluster Mode* <https://spark.apache.org/docs/latest/cluster-overview.html> > *.* These are described below: In the Client mode, *the driver daemon > runs in the node through which you submit the spark job to your cluster.* This > is often done through the Edge Node. This mode is valuable when you want to > use spark interactively like in our case where we would like to display > high value prices in the dashboard. In the Client mode you do not want to > reserve any resource from your cluster for the driver daemon In Cluster > mode, *you submit the spark job to your cluster and the driver daemon is > run inside your cluster and application master*. In this mode you do not > get to use the spark job interactively as the client through which you > submit the job is gone as soon as it successfully submits the job to > cluster. You will have to reserve some resources for the driver daemon > process as it will be running in your cluster. > > 2. This implementation should be in docker containers. > > --> Why? Are you going to deploy Spark on Kubernetes so whoever is > insisting thinking of portability > 3. I have to write Dockerfiles with Apache Hadoop with yarn and spark > (opensource only), > How can I do, this. > --> I know it is fashionable to deploy Spark in Kubernetes (Docker inside > pods) for resilence and scalability but you need to ask whoever is > requesting this to justify having spark inside docker as opposed to Spark > running alongside Hadoop on premise. > 4. To implement Apache Mesos with spark and deployment mode = cluster, > if have any kind of documentation or weblinks or your knowledge, could you > give that to me. > really I will help me a lot. > 5. We have services like minio, trino, superset, jupyter, and so on. > > OK we cross the bridge for 4 and 5 when there is justification for 2. The > alternative is for Spark to be installed on each or few designated nodes of > hadoop on your physical hosts on premise. > > > Kindly help me, to accomplish this. > let me know, what else you need. > > Thanks, > Dinakar > > > > On Sun, Jul 25, 2021 at 10:35 PM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Hi, >> >> Right you seem to have an on-premise hadoop cluster of 9 physical boxes >> and you want to deploy spark on it. >> >> What spec do you have for each physical host memory and CPU and disk >> space? >> >> You can take what is known as data affinity by putting your compute >> layers (spark) on the same hadoop nodes. >> >> Having hadoop implies that you have a YARN resource manager already plus >> HDFS. YARN is the most widely used resource manager on-premise for Spark. >> >> Provide some additional info and we go from there. . >> >> HTH >> >> >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 24 Jul 2021 at 13:46, Dinakar Chennubotla < >> chennu.bigd...@gmail.com> wrote: >> >>> Hi All, >>> >>> I am Dinakar, Hadoop admin, >>> could someone help me here, >>> >>> 1. I have a DEV-POC task to do, >>> 2. Need to Installing Distributed apache-spark cluster with Cluster mode >>> on Docker containers. >>> 3. with Scalable spark-worker containers. >>> 4. we have a 9 node cluster with some other services or tools. >>> >>> Thanks, >>> Dinakar >>> >>