[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076403#comment-14076403 ] Aaron Davidson commented on SPARK-1860: --- There's not an easy way to tell if an application is still running. However, the Worker has state about which executors are still running. This is really what I intended originally -- we must not clean up an Executor's own state from underneath it. I will change the title to reflect this intention. > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075911#comment-14075911 ] Mingyu Kim commented on SPARK-1860: --- Friendly ping? [~pwendell] Can you let me know if there is an easy way to tell from the worker node whether an app is active? Or, should we just go with the rather-fragile design as I proposed right above? > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047074#comment-14047074 ] Mingyu Kim commented on SPARK-1860: --- [~pwendell], would there be an easy way to tell from the worker node whether an app directory is active or not? In other words, can a worker node get the list of active application ids from the master? I thought this was not doable, so was just going to wipe out all app directories that haven't been used (i.e. no jobs have run even if the the application is still alive) based on the last modified date of the log files. What do you think? > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005596#comment-14005596 ] Andrew Ash commented on SPARK-1860: --- So the Spark master webui shows the running applications, so it at least knows what's running. I guess since this is running on a worker it may need to be told by the master what the active applications are. Not sure the internals of Spark very well but there's got to be a way to determine this. > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004747#comment-14004747 ] Mingyu Kim commented on SPARK-1860: --- [~aash], is there a reliable way to check "folder is owned by a running application". I thought that's not possible, so I was just going to have the second if statement, which means folder for running applications that just haven't been active for TTS will also get wiped out, assuming that executor is writing out something to either stdout or stderr when it runs some computation. This also means that if you have a long-running inactive application, the application should send a "heartbeat" by running a trivial computation once every while. Any suggestions? > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001151#comment-14001151 ] Andrew Ash commented on SPARK-1860: --- [~mkim] is going to take a look at this after discussion at https://issues.apache.org/jira/browse/SPARK-1154 I think the correct fix as Patrick outlines would be: {code} // pseudocode for folder in onDiskFolders: if folder is owned by a running application: continue if folder contains any folder/file (recursively) that is more recently touched (mtime) than the TTS: continue cleanUp(folder) {code} Schedule that to run periodically (interval configured by setting) and this should be all fixed up. Is that right? An alternative approach could be to have executor clean up the application's work directory when the application terminates, but un-clean executor shutdown could still leave work directories around so a TTL approach still needs to be included as well. > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999602#comment-13999602 ] Patrick Wendell commented on SPARK-1860: I think it would be better to only start the TTL once an executor has finished and to only delete the specific folder used by the executor. > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.0.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)