[jira] [Updated] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30462:
-
Priority: Major  (was: Critical)

> Structured Streaming _spark_metadata fills up Spark Driver memory when having 
> lots of objects
> -
>
> Key: SPARK-30462
> URL: https://issues.apache.org/jira/browse/SPARK-30462
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.4, 3.0.0
>Reporter: Vladimir Yankov
>Priority: Major
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not 
> seem to be possible to have a constantly running stream, writing millions of 
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages 
> from a Kafka cluster, transform them, and write them as compressed Parquet 
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the 
> compact files there grows to GB's that eventually fill up the Spark Driver's 
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run 
> without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an 
> option in our use-case, as we are using the information from the 
> _spark_metadata to have a register of the objects for faster querying and 
> search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects

2020-01-08 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-30462:
-
Affects Version/s: 3.0.0

> Structured Streaming _spark_metadata fills up Spark Driver memory when having 
> lots of objects
> -
>
> Key: SPARK-30462
> URL: https://issues.apache.org/jira/browse/SPARK-30462
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.4, 3.0.0
>Reporter: Vladimir Yankov
>Priority: Critical
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not 
> seem to be possible to have a constantly running stream, writing millions of 
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages 
> from a Kafka cluster, transform them, and write them as compressed Parquet 
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the 
> compact files there grows to GB's that eventually fill up the Spark Driver's 
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run 
> without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an 
> option in our use-case, as we are using the information from the 
> _spark_metadata to have a register of the objects for faster querying and 
> search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org