Hi Mich! The shell script indeed looks more robust now :D
Yes, the current setup works fine. I am wondering whether it is the right way to set up things? That is, should I run the program which accepts requests from the queue independently and have it invoke spark-submit cli or something else? Thanks again. Regards On Thu, Jul 1, 2021 at 4:44 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Kartik, > > I parameterized your shell script and tested on a stob python file and > looks OK, ensuring that the shell script is more robust > > > #!/bin/bash > set -e > > #cd "$(dirname "${BASH_SOURCE[0]}")/../" > > pyspark_venv="pyspark_venv" > source_zip_file="DSBQ.zip" > [ -d ${pyspark_venv} ] && rm -r -d ${pyspark_venv} > [ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz > [ -f ${source_zip_file} ] && rm -r -f ${source_zip_file} > > python3 -m venv ${pyspark_venv} > source ${pyspark_venv}/bin/activate > pip install -r requirements_spark.txt > pip install venv-pack > venv-pack -o ${pyspark_venv}.tar.gz > > export PYSPARK_DRIVER_PYTHON=python > export PYSPARK_PYTHON=./${pyspark_venv}/bin/python > spark-submit \ > --master local[4] \ > --conf > "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#${pyspark_venv} \ > /home/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py > > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 30 Jun 2021 at 19:21, Kartik Ohri <kartikohr...@gmail.com> wrote: > >> Hi Mich! >> >> We use this in production but indeed there is much scope for >> improvements, configuration being one of those :). >> >> Yes, we have a private on-premise cluster. We run Spark on YARN (no >> airflow etc.) which controls the scheduling and use HDFS as a datastore. >> >> Regards >> >> On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Thanks for the details Kartik. >>> >>> Let me go through these. The code itself and indentation looks good. >>> >>> One minor thing I noticed is that you are not using a yaml file >>> (config.yml) for your variables and you seem to embed them in your >>> config.py code. That is what I used to do before :) a friend advised me to >>> initialise with yaml and read them in python file. However, I guess that is >>> a personal style. >>> >>> Overall looking neat. I believe you are running all these on-premises >>> and not using airflow or composer for your scheduling. >>> >>> >>> Cheers >>> >>> >>> Mich >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 30 Jun 2021 at 18:39, Kartik Ohri <kartikohr...@gmail.com> >>> wrote: >>> >>>> Hi Mich! >>>> >>>> Thanks for the reply. >>>> >>>> The zip file contains all of the spark related >>>> code, particularly contents of this folder >>>> <https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark> >>>> . >>>> The requirements_spark.txt >>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt> >>>> is >>>> contained in the project and it contains the non-spark dependencies of the >>>> python code. >>>> The tar.gz file is created according to Pyspark docs >>>> <https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv> >>>> for >>>> dependency management. The spark.yarn.dist.archives also comes from >>>> there. >>>> >>>> This is the python file >>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py> >>>> invoked by the spark-submit to start the "RequestConsumer". >>>> >>>> Regards, >>>> Kartik >>>> >>>> >>>> On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Hi Kartik, >>>>> >>>>> Can you explain how you create your zip file? Does that include all in >>>>> your top project directory as per PyCharm etc. >>>>> >>>>> The rest looks Ok as you are creating a Python Virtual Env >>>>> >>>>> python3 -m venv pyspark_venv >>>>> source pyspark_venv/bin/activate >>>>> >>>>> How do you create that requirements_spark.txt file? >>>>> >>>>> pip install -r requirements_spark.txt >>>>> pip install venv-pack >>>>> >>>>> >>>>> Where is this gz file used? >>>>> venv-pack -o pyspark_venv.tar.gz >>>>> >>>>> Because I am not clear about below line >>>>> >>>>> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \ >>>>> >>>>> It helps if you walk us through the shell itself for clarification >>>>> HTH, >>>>> >>>>> Mich >>>>> >>>>> >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all! >>>>>> >>>>>> I am working on a Pyspark application and would like suggestions on >>>>>> how it should be structured. >>>>>> >>>>>> We have a number of possible jobs, organized in modules. There is >>>>>> also a "RequestConsumer >>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>" >>>>>> class which consumes from a messaging queue. Each message contains the >>>>>> name >>>>>> of the job to invoke and the arguments to be passed to it. Messages are >>>>>> put >>>>>> into the message queue by cronjobs, manually etc. >>>>>> >>>>>> We submit a zip file containing all python files to a Spark cluster >>>>>> running on YARN and ask it to run the RequestConsumer. This >>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34> >>>>>> is the exact spark-submit command for the interested. The results of the >>>>>> jobs are collected >>>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122> >>>>>> by the request consumer and pushed into another queue. >>>>>> >>>>>> My question is whether this type of structure makes sense. Should the >>>>>> Request Consumer instead run independently of Spark and invoke >>>>>> spark-submit >>>>>> scripts when it needs to trigger a job? Or is there another >>>>>> recommendation? >>>>>> >>>>>> Thank you all in advance for taking the time to read this email and >>>>>> helping. >>>>>> >>>>>> Regards, >>>>>> Kartik. >>>>>> >>>>>> >>>>>>