Hi Mich,

Before going team and conveying them, the things you said.
I want to have a call with you,
|can I have a call with you for few minutes to discuss the same.

Thanks,
dinakar


On Mon, Jul 26, 2021 at 1:43 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Dinakar,
>
>
> Hi Mich,
>
> Posting you my comments,
>
> Right, you seem to have an on-premise Hadoop cluster of 9 physical boxes
> and you want to deploy spark on it.
> *My comment: Yes.*
>
> What spec do you have for each physical host memory and CPU and disk space?
> *My comment: I am not sure, of the exact numbers. but all I can say,
> there is enough space to deploy few more tools across 9 nodes.*
>
> --> well you can get it from infrastructure guys. In all probability you
> have 9 data nodes some DL380 or better with >= 64GB of RAM and Quad core or
> something. Space does not matter for now
>
> You can take what is known as data affinity by putting your compute layers
> (spark) on the same Hadoop nodes.
> *My comment: I am not aware of this, need to check on these lines.*
>
> --> A better term is data locality but still useful in Hadoop clusters
> where Spark is installed. In a nutshell "Spark is a data parallel
> processing framework, which means it will execute tasks as close to where
> the data lives as possible (i.e. minimize data transfer
> <https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html>
> ).".
>
>
>
> Having Hadoop implies that you have a YARN resource manager already plus
> HDFS. YARN is the most widely used resource manager on-premise for Spark.
> *My comment: we haven't installed spark in our cluster.*
>
> *--> *When you install Hadoop you install Hadoop Core that has these
> artefacts:
>
>
>    1. Haddop Distributed File System (HDFS)
>    2. MapReduce
>    3. YARN resource management
>
> OK what does Spark offer.
>
> Generally, parallel architecture comes into play when the data size is
> significantly large which cannot be handled on a single machine, hence, the
> use of Spark becomes meaningful.
> In cases where (the generated) data size is going to be very large (which
> is often norm rather than the exception these days), the data cannot be
> processed and stored in Pandas data frames as these data frames store data
> in RAM. Then, the whole dataset from a storage like HDFS or cloud storage
> cannot be collected, because it will take significant time and space and
> probably won't fit in a single machine RAM.
>
> So you are replacing MapReduce on disk with equivalent in memory. Think of
> Spark as MapReduce on streiod sort of :).
>
>
>
> Additional information:
> ==================
> Agenda:
> 1. implementation of Apache Mesos or apache Hadoop yarn, including spark
> service with cluster mode.
> so that, if I submit pyspark or spark scala jobs having "deployment-mode =
> cluster", should work.
>
> --> Either way with YARN  it wil work with cluster or client deployment
> mode. This is from article
> <https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/>
> The term D*eployment mode of Spark*, simply means that “where the driver
> program will be run”. There are two ways, namely; *Spark Client Mode*
> <https://spark.apache.org/docs/latest/running-on-yarn.html>* and **Spark
> Cluster Mode* <https://spark.apache.org/docs/latest/cluster-overview.html>
> *.* These are described below: In the Client mode, *the driver daemon
> runs in the node through which you submit the spark job to your cluster.* This
> is often done through the Edge Node. This mode is valuable when you want to
> use spark interactively like in our case where we would like to display
> high value prices in the dashboard. In the Client mode you do not want to
> reserve any resource from your cluster for the driver daemon In Cluster
> mode, *you submit the spark job to your cluster and the driver daemon is
> run inside your cluster and application master*. In this mode you do not
> get to use the spark job interactively as the client through which you
> submit the job is gone as soon as it successfully submits the job to
> cluster. You will have to reserve some resources for the driver daemon
> process as it will be running in your cluster.
>
> 2. This implementation should be in docker containers.
>
> --> Why? Are you going to deploy Spark on Kubernetes so whoever is
> insisting thinking of portability
> 3. I have to write Dockerfiles with Apache Hadoop with yarn and spark
> (opensource only),
> How can I do, this.
> --> I know it is fashionable to deploy Spark in Kubernetes (Docker inside
> pods) for resilence and scalability but you need to ask whoever is
> requesting this to justify having spark inside docker as opposed to Spark
> running alongside Hadoop on premise.
> 4. To implement Apache Mesos with spark and deployment mode = cluster,
> if have any kind of documentation or weblinks or your knowledge, could you
> give that to me.
> really I will help me a lot.
> 5. We have services like minio, trino, superset, jupyter, and so on.
>
> OK we cross the bridge for 4 and 5 when there is justification for 2. The
> alternative is for Spark to be installed on each or few designated nodes of
> hadoop on your physical hosts on premise.
>
>
> Kindly help me, to accomplish this.
> let me know, what else you need.
>
> Thanks,
> Dinakar
>
>
>
> On Sun, Jul 25, 2021 at 10:35 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Right you seem to have an on-premise hadoop cluster of 9 physical boxes
>> and you want to deploy spark on it.
>>
>> What spec do you have for each physical host memory and CPU and disk
>> space?
>>
>> You can take what is known as data affinity by putting your compute
>> layers (spark) on the same hadoop nodes.
>>
>> Having hadoop implies that you have a YARN resource manager already  plus
>> HDFS. YARN is the most widely used resource manager on-premise for Spark.
>>
>> Provide some additional info and we go from there. .
>>
>> HTH
>>
>>
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 24 Jul 2021 at 13:46, Dinakar Chennubotla <
>> chennu.bigd...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am Dinakar, Hadoop admin,
>>> could someone help me here,
>>>
>>> 1. I have a DEV-POC task to do,
>>> 2. Need to Installing Distributed apache-spark cluster with Cluster mode
>>> on Docker containers.
>>> 3. with Scalable spark-worker containers.
>>> 4. we have a 9 node cluster with some other services or tools.
>>>
>>> Thanks,
>>> Dinakar
>>>
>>

Reply via email to