Vladimir Yankov created SPARK-30462:
---------------------------------------

             Summary: Structured Streaming _spark_metadata fills up Spark 
Driver memory when having lots of objects
                 Key: SPARK-30462
                 URL: https://issues.apache.org/jira/browse/SPARK-30462
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.4.4, 2.4.3
            Reporter: Vladimir Yankov


Hi,

With the current implementation of the Spark Structured Streaming it does not 
seem to be possible to have a constantly running stream, writing millions of 
files, without increasing the spark driver's memory to dozens of GB's.

In our scenario we are using Spark structured streaming to consume messages 
from a Kafka cluster, transform them, and write them as compressed Parquet 
files in an S3 Objectstore Service.
Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
objects, which respectively results within time to millions of objects in S3.
As all written objects are recorded in the _spark_metadata, the size of the 
compact files there grows to GB's that eventually fill up the Spark Driver's 
memory and lead to OOM errors.

We need the functionality to configure the spark structured streaming to run 
without loading all the historically accumulated metadata in its memory. 
Regularly resetting the _spark_metadata and the checkpoint folders is not an 
option in our use-case, as we are using the information from the 
_spark_metadata to have a register of the objects for faster querying and 
search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to