[jira] [Updated] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects
[ https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30462: - Priority: Major (was: Critical) > Structured Streaming _spark_metadata fills up Spark Driver memory when having > lots of objects > - > > Key: SPARK-30462 > URL: https://issues.apache.org/jira/browse/SPARK-30462 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3, 2.4.4, 3.0.0 >Reporter: Vladimir Yankov >Priority: Major > > Hi, > With the current implementation of the Spark Structured Streaming it does not > seem to be possible to have a constantly running stream, writing millions of > files, without increasing the spark driver's memory to dozens of GB's. > In our scenario we are using Spark structured streaming to consume messages > from a Kafka cluster, transform them, and write them as compressed Parquet > files in an S3 Objectstore Service. > Each 30 seconds a new batch of the spark-streaming is writing hundreds of > objects, which respectively results within time to millions of objects in S3. > As all written objects are recorded in the _spark_metadata, the size of the > compact files there grows to GB's that eventually fill up the Spark Driver's > memory and lead to OOM errors. > We need the functionality to configure the spark structured streaming to run > without loading all the historically accumulated metadata in its memory. > Regularly resetting the _spark_metadata and the checkpoint folders is not an > option in our use-case, as we are using the information from the > _spark_metadata to have a register of the objects for faster querying and > search of the written objects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects
[ https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-30462: - Affects Version/s: 3.0.0 > Structured Streaming _spark_metadata fills up Spark Driver memory when having > lots of objects > - > > Key: SPARK-30462 > URL: https://issues.apache.org/jira/browse/SPARK-30462 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3, 2.4.4, 3.0.0 >Reporter: Vladimir Yankov >Priority: Critical > > Hi, > With the current implementation of the Spark Structured Streaming it does not > seem to be possible to have a constantly running stream, writing millions of > files, without increasing the spark driver's memory to dozens of GB's. > In our scenario we are using Spark structured streaming to consume messages > from a Kafka cluster, transform them, and write them as compressed Parquet > files in an S3 Objectstore Service. > Each 30 seconds a new batch of the spark-streaming is writing hundreds of > objects, which respectively results within time to millions of objects in S3. > As all written objects are recorded in the _spark_metadata, the size of the > compact files there grows to GB's that eventually fill up the Spark Driver's > memory and lead to OOM errors. > We need the functionality to configure the spark structured streaming to run > without loading all the historically accumulated metadata in its memory. > Regularly resetting the _spark_metadata and the checkpoint folders is not an > option in our use-case, as we are using the information from the > _spark_metadata to have a register of the objects for faster querying and > search of the written objects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org