In a nutshell, the culprit for the OOM issue in your Spark driver appears
to be memory leakage or inefficient memory usage within your application.
This could be caused by factors such as:
1. Accumulation of data or objects in memory over time without proper
cleanup.
2. Inefficient data processing or transformations leading to excessive
memory usage.
3. Long-running tasks or stages that accumulate memory usage.
4. Suboptimal Spark configuration settings, such as insufficient memory
allocation for the driver or executors.
5. Check stages and executor tabs in Spark GUI
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
PhD Imperial College London
London, United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
On Tue, 11 Jun 2024 at 06:50, Lingzhe Sun wrote:
> Hi Kathick,
>
> That suggests you're not performing stateful operations and therefore
> there're no state related metrics. You should consider other aspects that
> may cause OOM.
> Checking logs will always be a good start. And it would be better if some
> colleague of you is familiar with JVM and OOM related issues.
>
> BS
> Lingzhe Sun
>
>
> *From:* Karthick Nk
> *Date:* 2024-06-11 13:28
> *To:* Lingzhe Sun
> *CC:* Andrzej Zera ; User
> *Subject:* Re: Re: OOM issue in Spark Driver
> Hi Lingzhe,
>
> I am able to get the below stats(i.e input rate, process rate, input rows
> etc..), but not able to find the exact stats that Andrzej asking (ie.
> Aggregated
> Number Of Total State Rows), Could you guide me on how do I get those
> details for states under structured streaming.
> [image: image.png]
>
> Details:
> I am using Databricks runtime version: 13.3 LTS (includes Apache Spark
> 3.4.1, Scala 2.12)
> Driver and worker type:
> [image: image.png]
>
>
> Thanks
>
>
> On Tue, Jun 11, 2024 at 7:34 AM Lingzhe Sun
> wrote:
>
>> Hi Kathick,
>>
>> I believed that what Andrzej means is that you should check
>> Aggregated Number Of Total State Rows
>> metirc which you could find in the structured streaming UI tab, which
>> indicate the total number of your states, only if you perform stateful
>> operations. If that increase indefinitely, you should probably check your
>> code logic.
>>
>> BS
>> Lingzhe Sun
>>
>>
>> *From:* Karthick Nk
>> *Date:* 2024-06-09 14:45
>> *To:* Andrzej Zera
>> *CC:* user
>> *Subject:* Re: OOM issue in Spark Driver
>> Hi Andrzej,
>>
>> We are using both driver and workers too,
>> Details are as follows
>> Driver size:128GB Memory, 64 cores.
>> Executors size: 64GB Memory, 32 Cores (Executors 1 to 10 - Autoscale)
>>
>> Workers memory usage:
>> One of the worker memory usage screenshot:
>> [image: image.png]
>>
>>
>> State metrics details below:
>> [image: image.png]
>> [image: image.png]
>>
>> I am not getting memory-related info from the structure streaming tab,
>> Could you help me here?
>>
>> Please let me know if you need more details.
>>
>> If possible we can connect once at your time and look into the issue
>> which will be more helpful to me.
>>
>> Thanks
>>
>> On Sat, Jun 8, 2024 at 2:41 PM Andrzej Zera
>> wrote:
>>
>>> Hey, do you perform stateful operations? Maybe your state is growing
>>> indefinitely - a screenshot with state metrics would help (you can find it
>>> in Spark UI -> Structured Streaming -> your query). Do you have a
>>> driver-only cluster or do you have workers too? What's the memory usage
>>> profile at workers?
>>>