Hi,

Have you put this question to Databricks forum

Data Engineering - Databricks
<https://community.databricks.com/t5/data-engineering/bd-p/data-engineering>


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 1 Apr 2024 at 07:22, Varun Shah <varunshah100...@gmail.com> wrote:

> Hi Community,
>
> I am currently exploring the best use of "Scheduler Pools" for executing
> jobs in parallel, and require clarification and suggestions on a few points.
>
> The implementation consists of executing "Structured Streaming" jobs on
> Databricks using AutoLoader. Each stream is executed with trigger =
> 'AvailableNow', ensuring that the streams don't keep running for the
> source. (we have ~4000 such streams, with no continuous stream from source,
> hence not keeping the streams running infinitely using other triggers).
>
> One way to achieve parallelism in the jobs is to use "MultiThreading", all
> using same SparkContext, as quoted from official docs: "Inside a given
> Spark application (SparkContext instance), multiple parallel jobs can run
> simultaneously if they were submitted from separate threads."
>
> There's also a availability of "FAIR Scheduler", which instead of FIFO
> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
> smaller jobs that were submitted later do not starve due to bigger jobs
> submitted early consuming all resources.
>
> Here are my questions:
> 1. The Round-Robin distribution of executors only work in case of empty
> executors (achievable by enabling dynamic allocation). In case the jobs
> (part of the same pool) requires all executors, second jobs will still need
> to wait.
> 2. If we create dynamic pools for submitting each stream (by setting spark
> property -> "spark.scheduler.pool" to a dynamic value as
> spark.sparkContext.setLocalProperty("spark.scheduler.pool", "<random unique
> string>") , how does executor allocation happen ? Since all pools created
> are created dynamically, they share equal weight. Does this also work the
> same way as submitting streams to a single pool as a FAIR scheduler ?
> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
> this true for the FAIR scheduler also ? By definition, it does not seem
> right, but it's confusing. It says "By Default" , so does it mean for FIFO
> scheduler or by default for both scheduling types ?
> 4. Are there any overhead for spark driver while creating / using a
> dynamically created spark pool vs pre-defined pools ?
>
> Apart from these, any suggestions or ways you have implemented
> auto-scaling for such loads ? We are currently trying to auto-scale the
> resources based on requests, but scaling down is an issue (known already
> for which SPIP is already in discussion, but it does not cater to
> submitting multiple streams in a single cluster.
>
> Thanks for reading !! Looking forward to your suggestions
>
> Regards,
> Varun Shah
>
>
>
>
>

Reply via email to