[ 
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-30462:
---------------------------------
    Affects Version/s: 3.0.0

> Structured Streaming _spark_metadata fills up Spark Driver memory when having 
> lots of objects
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30462
>                 URL: https://issues.apache.org/jira/browse/SPARK-30462
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.3, 2.4.4, 3.0.0
>            Reporter: Vladimir Yankov
>            Priority: Critical
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not 
> seem to be possible to have a constantly running stream, writing millions of 
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages 
> from a Kafka cluster, transform them, and write them as compressed Parquet 
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the 
> compact files there grows to GB's that eventually fill up the Spark Driver's 
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run 
> without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an 
> option in our use-case, as we are using the information from the 
> _spark_metadata to have a register of the objects for faster querying and 
> search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to