Hi, I think that reading Matei Zaharia's book "SPARK the definitive guide" will be a good and best starting point.
Regards, Gourav Sengupta On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri <kartikohr...@gmail.com> wrote: > Hi all! > > I am working on a Pyspark application and would like suggestions on how it > should be structured. > > We have a number of possible jobs, organized in modules. There is also a " > RequestConsumer > <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>" > class which consumes from a messaging queue. Each message contains the name > of the job to invoke and the arguments to be passed to it. Messages are put > into the message queue by cronjobs, manually etc. > > We submit a zip file containing all python files to a Spark cluster > running on YARN and ask it to run the RequestConsumer. This > <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34> > is the exact spark-submit command for the interested. The results of the > jobs are collected > <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122> > by the request consumer and pushed into another queue. > > My question is whether this type of structure makes sense. Should the > Request Consumer instead run independently of Spark and invoke spark-submit > scripts when it needs to trigger a job? Or is there another recommendation? > > Thank you all in advance for taking the time to read this email and > helping. > > Regards, > Kartik. > > >