Event Logging to HDFS on Standalone Cluster In Progress
Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history: http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.title=Application%20history%20not%20found%20(app-20150601160846-1914) The log does exists though: # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914 -rw-rw 3 user group1027848 2015-06-01 16:09 /eventLogs/app-20150601160846-1914 and `cat` the file ends with: {Event:SparkListenerApplicationEnd,Timestamp:1433174955077} This seems to indicate it saw and logged the application end. Is there a known issue here or a workaround? Looking at the source code I might have expected these files to end in `.inprogress` given the UI error message, but they don't. Thanks, Richard
Re: Event Logging to HDFS on Standalone Cluster In Progress
Ah, apologies, I found an existing issue and fix has already gone out for this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036. On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher rmarsc...@localytics.com wrote: It looks like it is possibly a race condition between removing the IN_PROGRESS and building the history UI for the application. `AppClient` sends an `UnregisterApplication(appId)` message to the `Master` actor, which triggers the process to look for the app's eventLogs. If they are suffixed with `.inprogress` then it will not build out the history UI and instead build the error page I've seen. Tying this together, calling SparkContext.stop() has the following block: if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null } if (_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false } _eventLogger.foreach(_.stop()) Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend which has an AppClient. AppClient sends itself a message to stop itself, and like mentioned above, it then sends a message to the Master where it tries to build the history UI. Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix is removed in the file-system. It seems like the race condition of the Akka message passing to trigger the Master's building of the history UI may be the only reason the history UI ever gets properly setup in the first place. Because if the ordering of calls were all strict in the SparkContext.stop method then you would expect the Master to always see the event logs as in in progress. Maybe I have missed something in tracing through the code? Is there a reason that the eventLogger cannot be closed before the dagScheduler? Thanks, Richard On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com wrote: Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history: http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.title=Application%20history%20not%20found%20(app-20150601160846-1914) The log does exists though: # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914 -rw-rw 3 user group1027848 2015-06-01 16:09 /eventLogs/app-20150601160846-1914 and `cat` the file ends with: {Event:SparkListenerApplicationEnd,Timestamp:1433174955077} This seems to indicate it saw and logged the application end. Is there a known issue here or a workaround? Looking at the source code I might have expected these files to end in `.inprogress` given the UI error message, but they don't. Thanks, Richard
Re: Event Logging to HDFS on Standalone Cluster In Progress
It looks like it is possibly a race condition between removing the IN_PROGRESS and building the history UI for the application. `AppClient` sends an `UnregisterApplication(appId)` message to the `Master` actor, which triggers the process to look for the app's eventLogs. If they are suffixed with `.inprogress` then it will not build out the history UI and instead build the error page I've seen. Tying this together, calling SparkContext.stop() has the following block: if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null } if (_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false } _eventLogger.foreach(_.stop()) Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend which has an AppClient. AppClient sends itself a message to stop itself, and like mentioned above, it then sends a message to the Master where it tries to build the history UI. Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix is removed in the file-system. It seems like the race condition of the Akka message passing to trigger the Master's building of the history UI may be the only reason the history UI ever gets properly setup in the first place. Because if the ordering of calls were all strict in the SparkContext.stop method then you would expect the Master to always see the event logs as in in progress. Maybe I have missed something in tracing through the code? Is there a reason that the eventLogger cannot be closed before the dagScheduler? Thanks, Richard On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com wrote: Hi, In Spark 1.3.0 I've enabled event logging to write to an existing HDFS folder on a Standalone cluster. This is generally working, all the logs are being written. However, from the Master Web UI, the vast majority of completed applications are labeled as not having a history: http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.title=Application%20history%20not%20found%20(app-20150601160846-1914) The log does exists though: # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914 -rw-rw 3 user group1027848 2015-06-01 16:09 /eventLogs/app-20150601160846-1914 and `cat` the file ends with: {Event:SparkListenerApplicationEnd,Timestamp:1433174955077} This seems to indicate it saw and logged the application end. Is there a known issue here or a workaround? Looking at the source code I might have expected these files to end in `.inprogress` given the UI error message, but they don't. Thanks, Richard