Github user jianjianjiao commented on a diff in the pull request: https://github.com/apache/spark/pull/22444#discussion_r218292773 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -465,20 +475,31 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) } } catch { case _: NoSuchElementException => - // If the file is currently not being tracked by the SHS, add an entry for it and try - // to parse it. This will allow the cleaner code to detect the file as stale later on - // if it was not possible to parse it. - listing.write(LogInfo(entry.getPath().toString(), newLastScanTime, None, None, - entry.getLen())) --- End diff -- Hi, @squito thanks for looking into this PR. When Spark history starts, it will scan event logs folder, and using multi-threads to handle. it will not do next scan before the first finishes. That is the problem, in our cluster, there are about 20K event-log files(often bigger than 1G), including like 1K .inprogress files, it takes about 2 and a half hours to do the first scan. that means, during this 2.5 hours, if an user submit a spark application, and it finishes, user cannot find it via the spark history UI, and has to wait for the next scan. That is why I add a limit of how much to scan each time, like set to 3K. That means no matter how many log files in the event-logs folder, it will first scan the first 3K and handle them, and then do the second scan, let's assume that during the first scan, there are 5 applications scanned, and there are another 10 applications updated. then the second scan will handle these 15 applications and another 2885 files ( from 3001 to 5885) in the event folder. checkForLogs scan event-log folders, and only handles files that are updated or not handled.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org