Github user vanzin commented on the issue: https://github.com/apache/spark/pull/16142 > I just add a new clean-up mode, but not add the cleaner itself. But that's kinda the point. How many different ways of cleaning need to be added? Will this one be enough? Will people ask for archiving next? I'm wary of going down that path. > I think you may not get what I mean. I get what you mean. I just disagree with you that it's an important feature to have. > So, I do not think get the size of each log will hurt NameNode greatly. The current scan code does not make one request to the NameNode per log file in the directory. Your code does. That should be avoided. > Besides, the unit test has proved that the older file will be cleaned first. Your code doesn't do that, so if the unit test shows that it's not by design. Your code is scanning the list of apps in the order they're kept in memory (descending end time). I don't remember whether in progress apps come first or last. But if they come first, an old attempt of an in progress app will have precedence over newer attempts of apps that have already finished. If they come last, then you're first accounting for log sizes of apps that have already finished and might end up trying to delete logs from apps that are still running (!!!). The way the current cleaner code works for time does not work if you're doing the `shouldClean` check solely based on space used. So this feature is not as trivial as your code make it seem.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org