Hey, do you perform stateful operations? Maybe your state is growing indefinitely - a screenshot with state metrics would help (you can find it in Spark UI -> Structured Streaming -> your query). Do you have a driver-only cluster or do you have workers too? What's the memory usage profile at workers?
Regards, Andrzej sob., 8 cze 2024 o 10:39 Karthick Nk <[email protected]> napisaĆ(a): > Hi All, > > I am using the pyspark structure streaming with Azure Databricks for data > load process. > > In the Pipeline I am using a Job cluster and I am running only one > pipeline, I am getting the OUT OF MEMORY issue while running for a > long time. When I inspect the metrics of the cluster I found that, the > memory usage getting increased by time by time even when there is no > huge volume of data. > > [image: image.png] > > > [image: image.png] > > After 4 hours of running the pipeline continuously, I am getting out of > memory issue where used memory in the driver getting increased from 47 GB > to 111 GB which is almost triple, I am unable to understand why this many > memory occcupied in the driver. Am I missing anything here to notice? Could > you guide me to figure out the root cause? > > Note: > 1. I confirmed persist and unpersist that I used in code taken care > properly for every batch execution. > 2. Data is not increasing when time passes, (stream data getting almost > same amount of data for every batch) > > > Thanks, > > > >
