Re: Installing Distributed apache spark cluster with Cluster mode on Docker

Khalid Mammadov Sun, 25 Jul 2021 06:47:35 -0700

Hi Dinakar

If you are aim is to run Spark in “distributed mode” then all these cluster 
modes (excluding local) runs the cluster in distributed mode anyway.
As I said before the “deployment =cluster” mode is only for Driver application 
and executors are running on worker nodes in parallel mode (distributed).
This is how Spark works. So you can only choose where to run a “driver” 
application which defines what to do and waits for application execution and 
actual/most work is done in the worker nodes.
That in mind you can start (submit) you python code from local and target a 
cluster started in standalone mode (using doker for example) and get your 
distributed execution.


# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000


Take a look below link and snippet from there:


https://spark.apache.org/docs/latest/submitting-applications.html



† A common deployment strategy is to submit your application from a gateway 
machine that is physically co-located with your worker machines (e.g. Master 
node in a standalone EC2 cluster). In this setup, client mode is appropriate. 
In client mode, the driver is launched directly within the spark-submit process 
which acts as a client to the cluster. The input and output of the application 
is attached to the console. Thus, this mode is especially suitable for 
applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the 
worker machines (e.g. locally on your laptop), it is common to use cluster mode 
to minimize network latency between the drivers and the executors. Currently, 
the standalone mode does not support cluster mode for Python applications.


Regarding Mesos and Yarn, I can’t comment on those as I don’t have experience 
with those modes. But I found this relevance for you ou: 
https://stackoverflow.com/questions/36461054/i-cant-seem-to-get-py-files-on-spark-to-work

Another suggestion is to keep CCing Spark user group email. So if I can’t 
answer then someone else may have. I am CCing and you can reply all.

Hope these all helps.

Regards
Khalid

Sent from my iPad

> On 25 Jul 2021, at 10:50, Dinakar Chennubotla <chennu.bigd...@gmail.com> 
> wrote:
> 
> Hi Khalid Mammadov,
> 
> I am now, reworking from scratch i.e. on How to build Distributed 
> apache-spark cluster, using yarn or apache mesos.
> 
> Sending you, my initial sketch. pictorial representation on the same.
> 
> Could you help me with the below:
> ==========================
> As per the Diagram,
> 1. I have to write Dockerfiles with Apache Hadoop with yarn and spark 
> (opensource only),
> How can I do, this 
> Your comments :
> 
> 2. To implement Apache Mesos with spark and deployment mode = cluster,
> if have any kind of documentation or weblinks or your knowledge, could you 
> give that to me.
> really I will help me a lot.
> your comments:
> 
> Thanks,
> dinakar
> 
> Thanks,
> Dinakar
> 
> On Sun, Jul 25, 2021 at 12:56 PM Khalid Mammadov <khalidmammad...@gmail.com> 
> wrote:
>> Sorry Dinakat, unfortunately I dont have much availablety, but you cant drop 
>> me your questions and I would be happy to help as much as I can
>> 
>> On Sun, 25 Jul 2021, 04:17 Dinakar Chennubotla, <chennu.bigd...@gmail.com> 
>> wrote:
>>> Agenda:
>>> 1. Hoq to implementation of Apache mesos or apache Hadoop yarn, including 
>>> spark service with cluster mode.
>>> 2.  Exploration on dockering the above tools 
>>> 
>>> 
>>> Thanks,
>>> Dinakar
>>> 
>>> On Sun, 25 Jul, 2021, 08:43 Dinakar Chennubotla, <chennu.bigd...@gmail.com> 
>>> wrote:
>>>> Hi Khalid Mammadov,.
>>>> 
>>>> With all the mail discussion that we had till now, you got brief knowledge 
>>>> on my issue.
>>>> 
>>>> I would like to request we plan a zoom meeting and can complete this in 
>>>> not exceeding more than one or two sessions.
>>>> 
>>>> Kindly, let me know your availability and comments.
>>>> 
>>>> If not, we will continue our mail discussion.
>>>> 
>>>> Thanks,
>>>> Dinakar
>>>> 
>>>> On Sun, 25 Jul, 2021, 01:12 Khalid Mammadov, <khalidmammad...@gmail.com> 
>>>> wrote:
>>>>> Had another look to your screen shot. It's also about Python, as this a 
>>>>> wrapper for java and cluster runs on java (JVM) it cant run python driver 
>>>>> inside. That means you can only run .jar files on cluster mode.
>>>>> 
>>>>> Hope all these make sense
>>>>> 
>>>>> On Sat, 24 Jul 2021, 19:58 Khalid Mammadov, <khalidmammad...@gmail.com> 
>>>>> wrote:
>>>>>> From that link:
>>>>>> Deploy mode
>>>>>> Distinguishes where the driver process runs. In "cluster" mode, the 
>>>>>> framework launches the driver inside of the cluster. In "client" mode, 
>>>>>> the submitter launches the driver outside of the cluster.
>>>>>> 
>>>>>> On Sat, 24 Jul 2021, 19:54 Khalid Mammadov, <khalidmammad...@gmail.com> 
>>>>>> wrote:
>>>>>>> Ok, now I see what is the problem. You get that error on spark-submit.
>>>>>>> 
>>>>>>> The error actually says that "you cant run a driver on a standalone 
>>>>>>> cluster as a cluster mode " as that is behaviour only supported by 
>>>>>>> mesos and yarn I think. 
>>>>>>> You need to read a bit about how spark executes a job on a cluster. 
>>>>>>> Essentialy there two modes, one when job is executed locally for the 
>>>>>>> cluster then your driver (orchestration) is running on your local 
>>>>>>> machine but execution (executors are running) is happening on the 
>>>>>>> cluster. And second method both of your driver and executors are 
>>>>>>> running on the cluster. 
>>>>>>> 
>>>>>>> Check this: https://spark.apache.org/docs/latest/cluster-overview.html
>>>>>>> 
>>>>>>> On Sat, 24 Jul 2021, 19:32 Dinakar Chennubotla, 
>>>>>>> <chennu.bigd...@gmail.com> wrote:
>>>>>>>> Sharing the screen shot,
>>>>>>>> 
>>>>>>>> Modified pyspark command with yarn instead of masterip and ran.
>>>>>>>> Got the same error
>>>>>>>> 
>>>>>>>> On Sat, 24 Jul, 2021, 23:30 Dinakar Chennubotla, 
>>>>>>>> <chennu.bigd...@gmail.com> wrote:
>>>>>>>>> Hi Khalid Mammadov,
>>>>>>>>> 
>>>>>>>>> Yes, sure. will compare with yours and let you know.
>>>>>>>>> 
>>>>>>>>> Meanwhile you can also have look, providing my docker files
>>>>>>>>> https://hub.docker.com/repository/docker/chennu1986/spark_redeploy
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Dinakar
>>>>>>>>> 
>>>>>>>>> On Sat, Jul 24, 2021 at 8:59 PM Khalid Mammadov 
>>>>>>>>> <khalidmammad...@gmail.com> wrote:
>>>>>>>>>> Can you share your Dockerfile (not all but gist of it) and 
>>>>>>>>>> instructions how you do it and what you actually run to get that 
>>>>>>>>>> message?
>>>>>>>>>> 
>>>>>>>>>> I have just pushed my local repo to Github where I have created an 
>>>>>>>>>> example of Spark on Docker some time ago.
>>>>>>>>>> Please take a look and compare what you are doing. 
>>>>>>>>>> 
>>>>>>>>>> https://github.com/khalidmammadov/spark_docker
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Jul 24, 2021 at 4:07 PM Dinakar Chennubotla 
>>>>>>>>>> <chennu.bigd...@gmail.com> wrote:
>>>>>>>>>>> Hi Khalid Mammadov,
>>>>>>>>>>> 
>>>>>>>>>>> I tried the which says distributed mode Spark installation. But 
>>>>>>>>>>> when run below command it says " deployment mode = cluster is not 
>>>>>>>>>>> allowed in standalone cluster".
>>>>>>>>>>> 
>>>>>>>>>>> Source Url I used is:
>>>>>>>>>>> https://towardsdatascience.com/diy-apache-spark-docker-bb4f11c10d24?gi=fa52ac767c0b
>>>>>>>>>>> 
>>>>>>>>>>> Kiddly refer this section in the url I mentioned.
>>>>>>>>>>> "Docker & Spark — Multiple Machines"
>>>>>>>>>>> 
>>>>>>>>>>> I removed third party things and dockerized my way.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Dinakar
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, 24 Jul, 2021, 20:28 Khalid Mammadov, 
>>>>>>>>>>> <khalidmammad...@gmail.com> wrote:
>>>>>>>>>>>> Standalone mode already implies you are running on cluster 
>>>>>>>>>>>> (distributed) mode. i.e. it's one of 4 available cluster manager 
>>>>>>>>>>>> options. The difference is Standalone uses it's one resource 
>>>>>>>>>>>> manager rather than using YARN for example.
>>>>>>>>>>>> If you are running docker on a single machine then you are limited 
>>>>>>>>>>>> to that but if you run your docker on a cluster and deploy your 
>>>>>>>>>>>> Spark containers on it then you will get your distribution and 
>>>>>>>>>>>> cluster mode.
>>>>>>>>>>>> And also If you are referring to scalability then you need to 
>>>>>>>>>>>> register worker nodes when you need to scale. 
>>>>>>>>>>>> You do it by registering a VM/container as a worker node as per 
>>>>>>>>>>>> doc using:
>>>>>>>>>>>> ./sbin/start-worker.sh <master-spark-URL>
>>>>>>>>>>>> 
>>>>>>>>>>>> You can create a new docker container with your base image and run 
>>>>>>>>>>>> the above command on the bootstrap and that would register a 
>>>>>>>>>>>> worker node and scale your cluster when you want. 
>>>>>>>>>>>> And if you kill them then you would scale down ( I think this is 
>>>>>>>>>>>> how Databricks autoscaling works..). I am not sure k8s TBH, 
>>>>>>>>>>>> perhaps it's handled this more gracefully
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Jul 24, 2021 at 3:38 PM Dinakar Chennubotla 
>>>>>>>>>>>> <chennu.bigd...@gmail.com> wrote:
>>>>>>>>>>>>> Hi Khalid Mammadov,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you for your response,
>>>>>>>>>>>>> Yes, I did, I built standalone apache spark cluster on docker 
>>>>>>>>>>>>> containers.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> But I am looking for distributed spark cluster,
>>>>>>>>>>>>> Where spark workers are scalable and spark "deployment mode  = 
>>>>>>>>>>>>> cluster".
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Source url I used to built standalone apache spark cluster
>>>>>>>>>>>>> https://www.kdnuggets.com/2020/07/apache-spark-cluster-docker.html
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you have documentation on distributed spark, which I am 
>>>>>>>>>>>>> looking for, could you please send me.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Dinakar
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, 24 Jul, 2021, 19:32 Khalid Mammadov, 
>>>>>>>>>>>>> <khalidmammad...@gmail.com> wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Have you checked out docs?
>>>>>>>>>>>>>> https://spark.apache.org/docs/latest/spark-standalone.html
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Khalid
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sat, Jul 24, 2021 at 1:45 PM Dinakar Chennubotla 
>>>>>>>>>>>>>> <chennu.bigd...@gmail.com> wrote:
>>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am Dinakar, Hadoop admin, 
>>>>>>>>>>>>>>> could someone help me here,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. I have a DEV-POC task to do,
>>>>>>>>>>>>>>> 2. Need to Installing Distributed apache-spark cluster with 
>>>>>>>>>>>>>>> Cluster mode on Docker containers.
>>>>>>>>>>>>>>> 3. with Scalable spark-worker containers.
>>>>>>>>>>>>>>> 4. we have a 9 node cluster with some other services or tools.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Dinakar
> 
> <IMG_20210725_151230.jpg>

Re: Installing Distributed apache spark cluster with Cluster mode on Docker

Reply via email to