[ https://issues.apache.org/jira/browse/SPARK-33133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213629#comment-17213629 ]
Hyukjin Kwon commented on SPARK-33133: -------------------------------------- cc [~kabhwan] FYI > History server fails when loading invalid rolling event logs > ------------------------------------------------------------ > > Key: SPARK-33133 > URL: https://issues.apache.org/jira/browse/SPARK-33133 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.1 > Reporter: Adam Binford > Priority: Major > > We have run into an issue where our history server fails to load new > applications, and when restarted, fails to load any applications at all. This > happens when it encounters invalid rolling event log files. We encounter this > with long running streaming applications. There seems to be two issues here > that lead to problems: > * It looks like our long running streaming applications event log directory > is being cleaned up. The next time the application logs event data, it > recreates the event log directory but without recreating the "appstatus" > file. I don't know the full extent of this behavior or if something "wrong" > is happening here. > * The history server then reads this new folder, and throws an exception > because the "appstatus" file doesn't exist in the rolling event log folder. > This exception breaks the entire listing process, so no new applications will > be read, and if restarted no applications at all will be read. > There seems like a couple ways to go about fixing this, and I'm curious > anyone's thoughts who knows more about how the history server works, > specifically with rolling event logs: > * Don't completely fail checking for new applications if one bad rolling > event log folder is encountered. This seems like the simplest fix and makes > sense to me, it already checks for a few other errors and ignores them. It > doesn't necessarily fix the underlying issue that leads to this happening > though. > * Figure out why the in progress event log folder is being deleted and make > sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't > want to delete the top level folder and only delete event log files within > the folders? Again I don't know the exact current behavior here with this. > * When writing new event log data, make sure the folder and appstatus file > exist every time, creating them again if not. > Here's the stack trace we encounter when this happens, from 3.0.1 with a > couple extra MRs backported that I hoped would fix the issue: > {{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in > checking for event log updates2020-10-13 12:10:31,751 ERROR > history.FsHistoryProvider: Exception in checking for event log > updatesjava.lang.IllegalArgumentException: requirement failed: Log directory > must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272) > at > org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at > scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at > scala.collection.TraversableLike.filter(TraversableLike.scala:347) at > scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at > scala.collection.AbstractTraversable.filter(Traversable.scala:108) at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org