[ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-22783: ---------------------------------- Parent: SPARK-28594 Issue Type: Sub-task (was: Bug) > event log directory(spark-history) filled by large .inprogress files for > spark streaming applications > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-22783 > URL: https://issues.apache.org/jira/browse/SPARK-22783 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 1.6.0, 2.1.0 > Environment: Linux(Generic) > Reporter: omkar kankalapati > Priority: Major > > When running long running streaming applications, the HDFS storage gets > filled up with large *.inprogress files in hdfs://spark-history/ directory > For example: > hadoop fs -du -h /spark-history > 234 /spark-history/<Application_1_ID>.inprogress > 46.6 G /spark-history/<Application_2_ID>.inprogress > Instead of continuing to write to a very large (multi GB) .inprogress file, > Spark should instead rotate the current log file when it reaches a size (for > example: 100 MB) or interval > and perhaps expose a configuration parameter for the size/interval. > This is also mentioned in SPARK-12140 as a concern. > It is very important and useful to support rotating the log files because > users may have limited HDFS quota and these large files consume the available > limited quota. > Also the users do not have a viable workaround > 1) Can not move the files to an another location because the moving the file > causes the event logging to stop > 2) Trying to copy the .inprogress file to another location and truncate the > .inprogress file fails because the file is still opened by > EventLoggingListener for writing > hdfs dfs -truncate -w 0 /spark-history/<application_id>.inprogress > truncate: Failed to TRUNCATE_FILE /spark-history/<application_id>.inprogress > for DFSClient_NONMAPREDUCE_<#ID>on <IP> because this file lease is currently > owned by DFSClient_NONMAPREDUCE_<#ID> on <IP> > The only workaround available is to disable the event logging for streaming > applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org