[ https://issues.apache.org/jira/browse/SPARK-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-18156. ------------------------------- Resolution: Invalid > CLONE - StreamExecution should discard unneeded metadata > -------------------------------------------------------- > > Key: SPARK-18156 > URL: https://issues.apache.org/jira/browse/SPARK-18156 > Project: Spark > Issue Type: Sub-task > Components: Streaming > Reporter: Sunil Kumar > Assignee: Frederick Reiss > Fix For: 2.1.0, 2.0.1 > > > The StreamExecution maintains a write-ahead log of batch metadata in order to > allow repeating previously in-flight batches if the driver is restarted. > StreamExecution does not garbage-collect or compact this log in any way. > Since the log is implemented with HDFSMetadataLog, these files will consume > memory on the HDFS NameNode. Specifically, each log file will consume about > 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the > block of file contents; see > [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html]. > An application with a 100 msec batch interval will increase the NameNode's > heap usage by about 250MB per day. > There is also the matter of recovery. StreamExecution reads its entire log > when restarting. This read operation will be very expensive if the log > contains millions of entries spread over millions of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org