Hi Dinakar,

Hi Mich,

Posting you my comments,

Right, you seem to have an on-premise Hadoop cluster of 9 physical boxes
and you want to deploy spark on it.
*My comment: Yes.*

What spec do you have for each physical host memory and CPU and disk space?
*My comment: I am not sure, of the exact numbers. but all I can say,  there
is enough space to deploy few more tools across 9 nodes.*

--> well you can get it from infrastructure guys. In all probability you
have 9 data nodes some DL380 or better with >= 64GB of RAM and Quad core or
something. Space does not matter for now

You can take what is known as data affinity by putting your compute layers
(spark) on the same Hadoop nodes.
*My comment: I am not aware of this, need to check on these lines.*

--> A better term is data locality but still useful in Hadoop clusters
where Spark is installed. In a nutshell "Spark is a data parallel
processing framework, which means it will execute tasks as close to where
the data lives as possible (i.e. minimize data transfer
<https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
).".



Having Hadoop implies that you have a YARN resource manager already plus
HDFS. YARN is the most widely used resource manager on-premise for Spark.
*My comment: we haven't installed spark in our cluster.*

*--> *When you install Hadoop you install Hadoop Core that has these
artefacts:


   1. Haddop Distributed File System (HDFS)
   2. MapReduce
   3. YARN resource management

OK what does Spark offer.

Generally, parallel architecture comes into play when the data size is
significantly large which cannot be handled on a single machine, hence, the
use of Spark becomes meaningful.
In cases where (the generated) data size is going to be very large (which
is often norm rather than the exception these days), the data cannot be
processed and stored in Pandas data frames as these data frames store data
in RAM. Then, the whole dataset from a storage like HDFS or cloud storage
cannot be collected, because it will take significant time and space and
probably won't fit in a single machine RAM.

So you are replacing MapReduce on disk with equivalent in memory. Think of
Spark as MapReduce on streiod sort of :).



Additional information:
==================
Agenda:
1. implementation of Apache Mesos or apache Hadoop yarn, including spark
service with cluster mode.
so that, if I submit pyspark or spark scala jobs having "deployment-mode =
cluster", should work.

--> Either way with YARN  it wil work with cluster or client deployment mode.
This is from article
<https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/>
The term D*eployment mode of Spark*, simply means that “where the driver
program will be run”. There are two ways, namely; *Spark Client Mode*
<https://spark.apache.org/docs/latest/running-on-yarn.html>* and **Spark
Cluster Mode* <https://spark.apache.org/docs/latest/cluster-overview.html>
*.* These are described below: In the Client mode, *the driver daemon runs
in the node through which you submit the spark job to your cluster.* This
is often done through the Edge Node. This mode is valuable when you want to
use spark interactively like in our case where we would like to display
high value prices in the dashboard. In the Client mode you do not want to
reserve any resource from your cluster for the driver daemon In Cluster
mode, *you submit the spark job to your cluster and the driver daemon is
run inside your cluster and application master*. In this mode you do not
get to use the spark job interactively as the client through which you
submit the job is gone as soon as it successfully submits the job to
cluster. You will have to reserve some resources for the driver daemon
process as it will be running in your cluster.

2. This implementation should be in docker containers.

--> Why? Are you going to deploy Spark on Kubernetes so whoever is
insisting thinking of portability
3. I have to write Dockerfiles with Apache Hadoop with yarn and spark
(opensource only),
How can I do, this.
--> I know it is fashionable to deploy Spark in Kubernetes (Docker inside
pods) for resilence and scalability but you need to ask whoever is
requesting this to justify having spark inside docker as opposed to Spark
running alongside Hadoop on premise.
4. To implement Apache Mesos with spark and deployment mode = cluster,
if have any kind of documentation or weblinks or your knowledge, could you
give that to me.
really I will help me a lot.
5. We have services like minio, trino, superset, jupyter, and so on.

OK we cross the bridge for 4 and 5 when there is justification for 2. The
alternative is for Spark to be installed on each or few designated nodes of
hadoop on your physical hosts on premise.


Kindly help me, to accomplish this.
let me know, what else you need.

Thanks,
Dinakar



On Sun, Jul 25, 2021 at 10:35 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Right you seem to have an on-premise hadoop cluster of 9 physical boxes
> and you want to deploy spark on it.
>
> What spec do you have for each physical host memory and CPU and disk space?
>
> You can take what is known as data affinity by putting your compute layers
> (spark) on the same hadoop nodes.
>
> Having hadoop implies that you have a YARN resource manager already  plus
> HDFS. YARN is the most widely used resource manager on-premise for Spark.
>
> Provide some additional info and we go from there. .
>
> HTH
>
>
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 24 Jul 2021 at 13:46, Dinakar Chennubotla <
> chennu.bigd...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am Dinakar, Hadoop admin,
>> could someone help me here,
>>
>> 1. I have a DEV-POC task to do,
>> 2. Need to Installing Distributed apache-spark cluster with Cluster mode
>> on Docker containers.
>> 3. with Scalable spark-worker containers.
>> 4. we have a 9 node cluster with some other services or tools.
>>
>> Thanks,
>> Dinakar
>>
>

Reply via email to