Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Hi,

In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
folder on a Standalone cluster. This is generally working, all the logs are
being written. However, from the Master Web UI, the vast majority of
completed applications are labeled as not having a history:
http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.title=Application%20history%20not%20found%20(app-20150601160846-1914)

The log does exists though:

# hdfs dfs -ls -R /eventLogs/app-20150601160846-1914

-rw-rw   3 user group1027848 2015-06-01 16:09
/eventLogs/app-20150601160846-1914

and `cat` the file ends with:

{Event:SparkListenerApplicationEnd,Timestamp:1433174955077}

This seems to indicate it saw and logged the application end.

Is there a known issue here or a workaround? Looking at the source code I
might have expected these files to end in `.inprogress` given the UI error
message, but they don't.

Thanks,
Richard


Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
Ah, apologies, I found an existing issue and fix has already gone out for
this in 1.3.1 and up: https://issues.apache.org/jira/browse/SPARK-6036.

On Mon, Jun 1, 2015 at 3:39 PM, Richard Marscher rmarsc...@localytics.com
wrote:

 It looks like it is possibly a race condition between removing the
 IN_PROGRESS and building the history UI for the application.

 `AppClient` sends an `UnregisterApplication(appId)` message to the
 `Master` actor, which triggers the process to look for the app's eventLogs.
 If they are suffixed with `.inprogress` then it will not build out the
 history UI and instead build the error page I've seen.

 Tying this together, calling SparkContext.stop() has the following block:


 if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null }
 if (_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false
 } _eventLogger.foreach(_.stop())
 Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend
 which has an AppClient. AppClient sends itself a message to stop itself,
 and like mentioned above, it then sends a message to the Master where it
 tries to build the history UI.

 Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix
 is removed in the file-system. It seems like the race condition of the Akka
 message passing to trigger the Master's building of the history UI may be
 the only reason the history UI ever gets properly setup in the first place.
 Because if the ordering of calls were all strict in the SparkContext.stop
 method then you would expect the Master to always see the event logs as in
 in progress.

 Maybe I have missed something in tracing through the code? Is there a
 reason that the eventLogger cannot be closed before the dagScheduler?

 Thanks,
 Richard

 On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher 
 rmarsc...@localytics.com wrote:

 Hi,

 In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
 folder on a Standalone cluster. This is generally working, all the logs are
 being written. However, from the Master Web UI, the vast majority of
 completed applications are labeled as not having a history:
 http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.title=Application%20history%20not%20found%20(app-20150601160846-1914)

 The log does exists though:

 # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914

 -rw-rw   3 user group1027848 2015-06-01 16:09
 /eventLogs/app-20150601160846-1914

 and `cat` the file ends with:

 {Event:SparkListenerApplicationEnd,Timestamp:1433174955077}

 This seems to indicate it saw and logged the application end.

 Is there a known issue here or a workaround? Looking at the source code I
 might have expected these files to end in `.inprogress` given the UI error
 message, but they don't.

 Thanks,
 Richard





Re: Event Logging to HDFS on Standalone Cluster In Progress

2015-06-01 Thread Richard Marscher
It looks like it is possibly a race condition between removing the
IN_PROGRESS and building the history UI for the application.

`AppClient` sends an `UnregisterApplication(appId)` message to the `Master`
actor, which triggers the process to look for the app's eventLogs. If they
are suffixed with `.inprogress` then it will not build out the history UI
and instead build the error page I've seen.

Tying this together, calling SparkContext.stop() has the following block:


if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null } if
(_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false }
_eventLogger.foreach(_.stop())
Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend
which has an AppClient. AppClient sends itself a message to stop itself,
and like mentioned above, it then sends a message to the Master where it
tries to build the history UI.

Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix is
removed in the file-system. It seems like the race condition of the Akka
message passing to trigger the Master's building of the history UI may be
the only reason the history UI ever gets properly setup in the first place.
Because if the ordering of calls were all strict in the SparkContext.stop
method then you would expect the Master to always see the event logs as in
in progress.

Maybe I have missed something in tracing through the code? Is there a
reason that the eventLogger cannot be closed before the dagScheduler?

Thanks,
Richard

On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher rmarsc...@localytics.com
wrote:

 Hi,

 In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
 folder on a Standalone cluster. This is generally working, all the logs are
 being written. However, from the Master Web UI, the vast majority of
 completed applications are labeled as not having a history:
 http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.title=Application%20history%20not%20found%20(app-20150601160846-1914)

 The log does exists though:

 # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914

 -rw-rw   3 user group1027848 2015-06-01 16:09
 /eventLogs/app-20150601160846-1914

 and `cat` the file ends with:

 {Event:SparkListenerApplicationEnd,Timestamp:1433174955077}

 This seems to indicate it saw and logged the application end.

 Is there a known issue here or a workaround? Looking at the source code I
 might have expected these files to end in `.inprogress` given the UI error
 message, but they don't.

 Thanks,
 Richard