Hey, do you perform stateful operations? Maybe your state is growing
indefinitely - a screenshot with state metrics would help (you can find it
in Spark UI -> Structured Streaming -> your query). Do you have a
driver-only cluster or do you have workers too? What's the memory usage
profile at workers?

Regards,
Andrzej


sob., 8 cze 2024 o 10:39 Karthick Nk <kcekarth...@gmail.com> napisaƂ(a):

> Hi All,
>
> I am using the pyspark structure streaming with Azure Databricks for data
> load process.
>
> In the Pipeline I am using a Job cluster and I am running only one
> pipeline, I am getting the OUT OF MEMORY issue while running for a
> long time. When I inspect the metrics of the cluster I found that, the
> memory usage getting increased by time by time even when there is no
> huge volume of data.
>
> [image: image.png]
>
>
> [image: image.png]
>
> After 4 hours of running the pipeline continuously, I am getting out of
> memory issue where used memory in the driver getting increased from 47 GB
> to 111 GB which is almost triple, I am unable to understand why this many
> memory occcupied in the driver. Am I missing anything here to notice? Could
> you guide me to figure out the root cause?
>
> Note:
> 1. I confirmed persist and unpersist that I used in code taken care
> properly for every batch execution.
> 2. Data is not increasing when time passes, (stream data getting almost
> same amount of data for every batch)
>
>
> Thanks,
>
>
>
>

Reply via email to