HeartSaVioR commented on a change in pull request #26416: [SPARK-29779][CORE] 
Compact old event log files and cleanup
URL: https://github.com/apache/spark/pull/26416#discussion_r358260674
 
 

 ##########
 File path: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
 ##########
 @@ -663,13 +670,49 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
     }
   }
 
+  private[spark] def getOrUpdateCompactible(reader: EventLogFileReader): 
Option[Boolean] = {
+    try {
+      val info = listing.read(classOf[LogInfo], reader.rootPath.toString)
+      val compactible = checkEligibilityForCompaction(info, reader)
+      if (info.compactible != compactible) {
+        listing.write(info.copy(compactible = compactible))
+      }
+      compactible
+    } catch {
+      case _: NoSuchElementException => None
+    }
+  }
+
+  protected def checkEligibilityForCompaction(
+      info: LogInfo,
+      reader: EventLogFileReader): Option[Boolean] = {
+    info.compactible.orElse {
+      // This is not applied to single event log file.
+      if (reader.lastIndex.isEmpty) {
+        Some(false)
+      } else {
+        if (reader.listEventLogFiles.length > 1) {
+          // We have at least one 'complete' file to check whether the event 
log is eligible to
+          // compact further.
+          val rate = eventFilterRateCalculator.calculate(
 
 Review comment:
   I'll just consolidate filter rate calculator and compactor into one so that 
event filters built from first phase read can be applied to both scoring and 
compacting.
   
   I just took simplest calculation when calculate "score": given the 
calculation would be based on heuristic (to avoid reading event logs again if 
the compaction doesn't seem to help... that's all what I have been struggling 
for), improving the calculation requires experiences on dealing with bunch of 
cases of event logs which I honestly don't have and should take time if we 
require it in the scope of this PR. I guess the simple calculation would work 
for streaming scenario (maybe as well as long running "shell" session which 
runs interactive queries) so it seems OK to start from here and exclude the 
improvement from the scope of this PR.
   
   And how many files/lines/bytes we should read to decide whether the app 
doesn't need to be analyzed further (to even skip reading first phase read)? 
This seems to be require another heuristic, which I'd also like to exclude from 
the scope of this PR. Currently I updated the logic to calculate each time 
where new index of event log file comes.
   
   We may also be able to improve building event filter via incremental read 
per event log file which might be another point of improvement. I'll see how to 
deal with this, but it'd be ideal if we can also exclude it the scope of this 
PR.
   
   Does the proposal work for you, or which things you consider as mandatory?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to