Re: Structuring a PySpark Application

Kartik Ohri Thu, 01 Jul 2021 04:37:20 -0700

Hi Mich!

The shell script indeed looks more robust now :D


Yes, the current setup works fine. I am wondering whether it is the right
way to set up things? That is, should I run the program which accepts
requests from the queue independently and have it invoke spark-submit cli
or something else?

Thanks again.

Regards

On Thu, Jul 1, 2021 at 4:44 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Kartik,
>
> I parameterized your shell script and tested on a stob python file and
> looks OK, ensuring that the shell script is more robust
>
>
> #!/bin/bash
> set -e
>
> #cd "$(dirname "${BASH_SOURCE[0]}")/../"
>
> pyspark_venv="pyspark_venv"
> source_zip_file="DSBQ.zip"
> [ -d ${pyspark_venv} ] && rm -r -d ${pyspark_venv}
> [ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
> [ -f ${source_zip_file} ] && rm -r -f ${source_zip_file}
>
> python3 -m venv ${pyspark_venv}
> source ${pyspark_venv}/bin/activate
> pip install -r requirements_spark.txt
> pip install venv-pack
> venv-pack -o ${pyspark_venv}.tar.gz
>
> export PYSPARK_DRIVER_PYTHON=python
> export PYSPARK_PYTHON=./${pyspark_venv}/bin/python
> spark-submit \
>         --master local[4] \
>         --conf
> "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#${pyspark_venv} \
>         /home/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Jun 2021 at 19:21, Kartik Ohri <kartikohr...@gmail.com> wrote:
>
>> Hi Mich!
>>
>> We use this in production but indeed there is much scope for
>> improvements, configuration being one of those :).
>>
>> Yes, we have a private on-premise cluster. We run Spark on YARN (no
>> airflow etc.) which controls the scheduling and use HDFS as a datastore.
>>
>> Regards
>>
>> On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks for the details Kartik.
>>>
>>> Let me go through these. The code itself and indentation looks good.
>>>
>>> One minor thing I noticed is that you are not using a yaml file
>>> (config.yml) for your variables and you seem to embed them in your
>>> config.py code. That is what I used to do before :) a friend advised me to
>>> initialise with yaml and read them in python file. However, I guess that is
>>> a personal style.
>>>
>>> Overall looking neat. I believe you are running all these on-premises
>>> and not using airflow or composer for your scheduling.
>>>
>>>
>>> Cheers
>>>
>>>
>>> Mich
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 30 Jun 2021 at 18:39, Kartik Ohri <kartikohr...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mich!
>>>>
>>>> Thanks for the reply.
>>>>
>>>> The zip file contains all of the spark related
>>>> code, particularly contents of this folder
>>>> <https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark>
>>>> .
>>>> The requirements_spark.txt
>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt>
>>>>  is
>>>> contained in the project and it contains the non-spark dependencies of the
>>>> python code.
>>>> The tar.gz file is created according to Pyspark docs
>>>> <https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv>
>>>>  for
>>>> dependency management. The spark.yarn.dist.archives also comes from
>>>> there.
>>>>
>>>> This is the python file
>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py>
>>>> invoked by the spark-submit to start the "RequestConsumer".
>>>>
>>>> Regards,
>>>> Kartik
>>>>
>>>>
>>>> On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Kartik,
>>>>>
>>>>> Can you explain how you create your zip file? Does that include all in
>>>>> your top project directory as per PyCharm etc.
>>>>>
>>>>> The rest looks Ok as you are creating a Python Virtual Env
>>>>>
>>>>> python3 -m venv pyspark_venv
>>>>> source pyspark_venv/bin/activate
>>>>>
>>>>> How do you create that requirements_spark.txt file?
>>>>>
>>>>> pip install -r requirements_spark.txt
>>>>> pip install venv-pack
>>>>>
>>>>>
>>>>> Where is this gz file used?
>>>>> venv-pack -o pyspark_venv.tar.gz
>>>>>
>>>>> Because I am not clear about below line
>>>>>
>>>>> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \
>>>>>
>>>>> It helps if you walk us through the shell itself for clarification
>>>>> HTH,
>>>>>
>>>>> Mich
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> I am working on a Pyspark application and would like suggestions on
>>>>>> how it should be structured.
>>>>>>
>>>>>> We have a number of possible jobs, organized in modules. There is
>>>>>> also a "RequestConsumer
>>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
>>>>>> class which consumes from a messaging queue. Each message contains the 
>>>>>> name
>>>>>> of the job to invoke and the arguments to be passed to it. Messages are 
>>>>>> put
>>>>>> into the message queue by cronjobs, manually etc.
>>>>>>
>>>>>> We submit a zip file containing all python files to a Spark cluster
>>>>>> running on YARN and ask it to run the RequestConsumer. This
>>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
>>>>>> is the exact spark-submit command for the interested. The results of the
>>>>>> jobs are collected
>>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
>>>>>> by the request consumer and pushed into another queue.
>>>>>>
>>>>>> My question is whether this type of structure makes sense. Should the
>>>>>> Request Consumer instead run independently of Spark and invoke 
>>>>>> spark-submit
>>>>>> scripts when it needs to trigger a job? Or is there another 
>>>>>> recommendation?
>>>>>>
>>>>>> Thank you all in advance for taking the time to read this email and
>>>>>> helping.
>>>>>>
>>>>>> Regards,
>>>>>> Kartik.
>>>>>>
>>>>>>
>>>>>>

Re: Structuring a PySpark Application

Reply via email to