It looks like it is possibly a race condition between removing the
IN_PROGRESS and building the history UI for the application.

`AppClient` sends an `UnregisterApplication(appId)` message to the `Master`
actor, which triggers the process to look for the app's eventLogs. If they
are suffixed with `.inprogress` then it will not build out the history UI
and instead build the error page I've seen.

Tying this together, calling SparkContext.stop() has the following block:


if (_dagScheduler != null) { _dagScheduler.stop() _dagScheduler = null } if
(_listenerBusStarted) { listenerBus.stop() _listenerBusStarted = false }
_eventLogger.foreach(_.stop())
Dag Scheduler has a TaskScheduler which has a SparkDeploySchedulerBackend
which has an AppClient. AppClient sends itself a message to stop itself,
and like mentioned above, it then sends a message to the Master where it
tries to build the history UI.

Meanwhile, EventLoggingListener.stop() is where the `.inprogress` suffix is
removed in the file-system. It seems like the race condition of the Akka
message passing to trigger the Master's building of the history UI may be
the only reason the history UI ever gets properly setup in the first place.
Because if the ordering of calls were all strict in the SparkContext.stop
method then you would expect the Master to always see the event logs as in
in progress.

Maybe I have missed something in tracing through the code? Is there a
reason that the eventLogger cannot be closed before the dagScheduler?

Thanks,
Richard

On Mon, Jun 1, 2015 at 12:23 PM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> Hi,
>
> In Spark 1.3.0 I've enabled event logging to write to an existing HDFS
> folder on a Standalone cluster. This is generally working, all the logs are
> being written. However, from the Master Web UI, the vast majority of
> completed applications are labeled as not having a history:
> http://xxx.xxx.xxx.xxx:8080/history/not-found/?msg=Application+App+is+still+in+progress.&title=Application%20history%20not%20found%20(app-20150601160846-1914)
>
> The log does exists though:
>
> # hdfs dfs -ls -R /eventLogs/app-20150601160846-1914
>
> -rw-rw----   3 user group    1027848 2015-06-01 16:09
> /eventLogs/app-20150601160846-1914
>
> and `cat` the file ends with:
>
> {"Event":"SparkListenerApplicationEnd","Timestamp":1433174955077}
>
> This seems to indicate it saw and logged the application end.
>
> Is there a known issue here or a workaround? Looking at the source code I
> might have expected these files to end in `.inprogress` given the UI error
> message, but they don't.
>
> Thanks,
> Richard
>

Reply via email to