HeartSaVioR commented on a change in pull request #26416: [SPARK-29779][CORE] Compact old event log files and cleanup URL: https://github.com/apache/spark/pull/26416#discussion_r358260674
########## File path: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ########## @@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) } } + private[spark] def getOrUpdateCompactible(reader: EventLogFileReader): Option[Boolean] = { + try { + val info = listing.read(classOf[LogInfo], reader.rootPath.toString) + val compactible = checkEligibilityForCompaction(info, reader) + if (info.compactible != compactible) { + listing.write(info.copy(compactible = compactible)) + } + compactible + } catch { + case _: NoSuchElementException => None + } + } + + protected def checkEligibilityForCompaction( + info: LogInfo, + reader: EventLogFileReader): Option[Boolean] = { + info.compactible.orElse { + // This is not applied to single event log file. + if (reader.lastIndex.isEmpty) { + Some(false) + } else { + if (reader.listEventLogFiles.length > 1) { + // We have at least one 'complete' file to check whether the event log is eligible to + // compact further. + val rate = eventFilterRateCalculator.calculate( Review comment: I'll just consolidate filter rate calculator and compactor into one so that event filters built from first phase read can be applied to both scoring and compacting. I just took simplest calculation when calculate "score": given the calculation would be based on heuristic (to avoid reading event logs again if the compaction doesn't seem to help... that's all what I have been struggling for), improving the calculation requires experiences on dealing with bunch of cases of event logs which I honestly don't have and should take time if we require it in the scope of this PR. I guess the simple calculation would work for streaming scenario (maybe as well as long running "shell" session which runs interactive queries) so it seems OK to start from here and exclude the improvement from the scope of this PR. And how many files/lines/bytes we should read to decide whether the app doesn't need to be analyzed further (to even skip reading first phase read)? This seems to be require another heuristic, which I'd also like to exclude from the scope of this PR. Currently I updated the logic to calculate each time where new index of event log file comes. We may also be able to improve building event filter via incremental read per event log file which might be another point of improvement. I'll see how to deal with this, but it'd be ideal if we can also exclude it the scope of this PR. Does the proposal work for you, or which things you consider as mandatory? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org