[ 
https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22783:
----------------------------------
        Parent: SPARK-28594
    Issue Type: Sub-task  (was: Bug)

> event log directory(spark-history) filled by large .inprogress files for 
> spark streaming applications
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22783
>                 URL: https://issues.apache.org/jira/browse/SPARK-22783
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 1.6.0, 2.1.0
>         Environment: Linux(Generic)
>            Reporter: omkar kankalapati
>            Priority: Major
>
> When running long running streaming applications, the HDFS storage gets 
> filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234     /spark-history/<Application_1_ID>.inprogress
> 46.6 G  /spark-history/<Application_2_ID>.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  
> Spark should instead rotate the current log file when it reaches a size (for 
> example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because 
> users may have limited HDFS quota and these large files consume the available 
> limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file 
> causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the 
> .inprogress file fails because the file is still opened by 
> EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/<application_id>.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/<application_id>.inprogress 
> for DFSClient_NONMAPREDUCE_<#ID>on <IP> because this file lease is currently 
> owned by DFSClient_NONMAPREDUCE_<#ID> on <IP>
> The only workaround available is to disable the event logging for streaming 
> applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to