[ 
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017782#comment-17017782
 ] 

Sivakumar commented on SPARK-30462:
-----------------------------------

Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

Is it possible to create the __spark__metadata directory else where or disable 
without any data loss.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

(SPARK-30542) - I have created a separate ticket for this.

> Structured Streaming _spark_metadata fills up Spark Driver memory when having 
> lots of objects
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30462
>                 URL: https://issues.apache.org/jira/browse/SPARK-30462
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.3, 2.4.4, 3.0.0
>            Reporter: Vladimir Yankov
>            Priority: Critical
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not 
> seem to be possible to have a constantly running stream, writing millions of 
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages 
> from a Kafka cluster, transform them, and write them as compressed Parquet 
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the 
> compact files there grows to GB's that eventually fill up the Spark Driver's 
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run 
> without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an 
> option in our use-case, as we are using the information from the 
> _spark_metadata to have a register of the objects for faster querying and 
> search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to