Hi Mich,

I did not post in the databricks community, as most of the questions were
related to spark itself.

But let me also post the question on databricks community.

Thanks,
Varun Shah

On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Have you put this question to Databricks forum
>
> Data Engineering - Databricks
> <https://community.databricks.com/t5/data-engineering/bd-p/data-engineering>
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 1 Apr 2024 at 07:22, Varun Shah <varunshah100...@gmail.com> wrote:
>
>> Hi Community,
>>
>> I am currently exploring the best use of "Scheduler Pools" for executing
>> jobs in parallel, and require clarification and suggestions on a few points.
>>
>> The implementation consists of executing "Structured Streaming" jobs on
>> Databricks using AutoLoader. Each stream is executed with trigger =
>> 'AvailableNow', ensuring that the streams don't keep running for the
>> source. (we have ~4000 such streams, with no continuous stream from source,
>> hence not keeping the streams running infinitely using other triggers).
>>
>> One way to achieve parallelism in the jobs is to use "MultiThreading",
>> all using same SparkContext, as quoted from official docs: "Inside a given
>> Spark application (SparkContext instance), multiple parallel jobs can run
>> simultaneously if they were submitted from separate threads."
>>
>> There's also a availability of "FAIR Scheduler", which instead of FIFO
>> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
>> smaller jobs that were submitted later do not starve due to bigger jobs
>> submitted early consuming all resources.
>>
>> Here are my questions:
>> 1. The Round-Robin distribution of executors only work in case of empty
>> executors (achievable by enabling dynamic allocation). In case the jobs
>> (part of the same pool) requires all executors, second jobs will still need
>> to wait.
>> 2. If we create dynamic pools for submitting each stream (by setting
>> spark property -> "spark.scheduler.pool" to a dynamic value as
>> spark.sparkContext.setLocalProperty("spark.scheduler.pool", "<random unique
>> string>") , how does executor allocation happen ? Since all pools created
>> are created dynamically, they share equal weight. Does this also work the
>> same way as submitting streams to a single pool as a FAIR scheduler ?
>> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
>> this true for the FAIR scheduler also ? By definition, it does not seem
>> right, but it's confusing. It says "By Default" , so does it mean for FIFO
>> scheduler or by default for both scheduling types ?
>> 4. Are there any overhead for spark driver while creating / using a
>> dynamically created spark pool vs pre-defined pools ?
>>
>> Apart from these, any suggestions or ways you have implemented
>> auto-scaling for such loads ? We are currently trying to auto-scale the
>> resources based on requests, but scaling down is an issue (known already
>> for which SPIP is already in discussion, but it does not cater to
>> submitting multiple streams in a single cluster.
>>
>> Thanks for reading !! Looking forward to your suggestions
>>
>> Regards,
>> Varun Shah
>>
>>
>>
>>
>>

Reply via email to