[ 
https://issues.apache.org/jira/browse/SPARK-33133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213629#comment-17213629
 ] 

Hyukjin Kwon commented on SPARK-33133:
--------------------------------------

cc [~kabhwan] FYI

> History server fails when loading invalid rolling event logs
> ------------------------------------------------------------
>
>                 Key: SPARK-33133
>                 URL: https://issues.apache.org/jira/browse/SPARK-33133
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.1
>            Reporter: Adam Binford
>            Priority: Major
>
> We have run into an issue where our history server fails to load new 
> applications, and when restarted, fails to load any applications at all. This 
> happens when it encounters invalid rolling event log files. We encounter this 
> with long running streaming applications. There seems to be two issues here 
> that lead to problems:
>  * It looks like our long running streaming applications event log directory 
> is being cleaned up. The next time the application logs event data, it 
> recreates the event log directory but without recreating the "appstatus" 
> file. I don't know the full extent of this behavior or if something "wrong" 
> is happening here.
>  * The history server then reads this new folder, and throws an exception 
> because the "appstatus" file doesn't exist in the rolling event log folder. 
> This exception breaks the entire listing process, so no new applications will 
> be read, and if restarted no applications at all will be read.
> There seems like a couple ways to go about fixing this, and I'm curious 
> anyone's thoughts who knows more about how the history server works, 
> specifically with rolling event logs:
>  * Don't completely fail checking for new applications if one bad rolling 
> event log folder is encountered. This seems like the simplest fix and makes 
> sense to me, it already checks for a few other errors and ignores them. It 
> doesn't necessarily fix the underlying issue that leads to this happening 
> though.
>  * Figure out why the in progress event log folder is being deleted and make 
> sure that doesn't happen. Maybe this is supposed to happen? Or maybe we don't 
> want to delete the top level folder and only delete event log files within 
> the folders? Again I don't know the exact current behavior here with this.
>  * When writing new event log data, make sure the folder and appstatus file 
> exist every time, creating them again if not.
> Here's the stack trace we encounter when this happens, from 3.0.1 with a 
> couple extra MRs backported that I hoped would fix the issue:
> {{2020-10-13 12:10:31,751 ERROR history.FsHistoryProvider: Exception in 
> checking for event log updates2020-10-13 12:10:31,751 ERROR 
> history.FsHistoryProvider: Exception in checking for event log 
> updatesjava.lang.IllegalArgumentException: requirement failed: Log directory 
> must contain an appstatus file! at scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:214)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files(EventLogFileReaders.scala:211)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles$lzycompute(EventLogFileReaders.scala:221)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.eventLogFiles(EventLogFileReaders.scala:220)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.lastEventLogFile(EventLogFileReaders.scala:272)
>  at 
> org.apache.spark.deploy.history.RollingEventLogFilesFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:240)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:524)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
>  at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at 
> scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at 
> scala.collection.TraversableLike.filter(TraversableLike.scala:347) at 
> scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at 
> scala.collection.AbstractTraversable.filter(Traversable.scala:108) at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
>  at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to