Hi, Have you put this question to Databricks forum
Data Engineering - Databricks <https://community.databricks.com/t5/data-engineering/bd-p/data-engineering> Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 1 Apr 2024 at 07:22, Varun Shah <varunshah100...@gmail.com> wrote: > Hi Community, > > I am currently exploring the best use of "Scheduler Pools" for executing > jobs in parallel, and require clarification and suggestions on a few points. > > The implementation consists of executing "Structured Streaming" jobs on > Databricks using AutoLoader. Each stream is executed with trigger = > 'AvailableNow', ensuring that the streams don't keep running for the > source. (we have ~4000 such streams, with no continuous stream from source, > hence not keeping the streams running infinitely using other triggers). > > One way to achieve parallelism in the jobs is to use "MultiThreading", all > using same SparkContext, as quoted from official docs: "Inside a given > Spark application (SparkContext instance), multiple parallel jobs can run > simultaneously if they were submitted from separate threads." > > There's also a availability of "FAIR Scheduler", which instead of FIFO > Scheduler (default), assigns executors in Round-Robin fashion, ensuring the > smaller jobs that were submitted later do not starve due to bigger jobs > submitted early consuming all resources. > > Here are my questions: > 1. The Round-Robin distribution of executors only work in case of empty > executors (achievable by enabling dynamic allocation). In case the jobs > (part of the same pool) requires all executors, second jobs will still need > to wait. > 2. If we create dynamic pools for submitting each stream (by setting spark > property -> "spark.scheduler.pool" to a dynamic value as > spark.sparkContext.setLocalProperty("spark.scheduler.pool", "<random unique > string>") , how does executor allocation happen ? Since all pools created > are created dynamically, they share equal weight. Does this also work the > same way as submitting streams to a single pool as a FAIR scheduler ? > 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is > this true for the FAIR scheduler also ? By definition, it does not seem > right, but it's confusing. It says "By Default" , so does it mean for FIFO > scheduler or by default for both scheduling types ? > 4. Are there any overhead for spark driver while creating / using a > dynamically created spark pool vs pre-defined pools ? > > Apart from these, any suggestions or ways you have implemented > auto-scaling for such loads ? We are currently trying to auto-scale the > resources based on requests, but scaling down is an issue (known already > for which SPIP is already in discussion, but it does not cater to > submitting multiple streams in a single cluster. > > Thanks for reading !! Looking forward to your suggestions > > Regards, > Varun Shah > > > > >