Re: Best practices of maintaining a long running SparkContext

Zhong Wang Tue, 08 Mar 2016 12:42:06 -0800

+spark-users

We are using Zeppelin (http://zeppelin.incubator.apache.org) as our UI to
run spark jobs. Zeppelin maintains a long running SparkContext, and we run
into a couple of issues:
--
1. Dynamic resource allocation keeps removing and registering executors,
even though no jobs are running
2. EventLogging doesn't work due to HDFS lease issue. Similar to this:
https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3ccae6kwsp_c00gksmnx0obu5aouxphdjs-syqywt-jfi3psvc...@mail.gmail.com%3E
3. SparkUI is getting slower due to large number of history jobs
4. Cached data is gone mystically (shown in the Storage page, but not in
the Executor page)


The aim of this thread is not resolve specific issues (though any ideas on
the listed issue will be welcome), but to hear suggestions about the best
practices of maintaining a long running SparkContext from both the Spark
and Zeppelin community.

Thanks,
Zhong

On Tue, Mar 8, 2016 at 11:13 AM, Zhong Wang <wangzhong....@gmail.com> wrote:

> Thanks for your insights, Deenar. I think this is really helpful to users
> who want to run Zeppelin as a service.
>
> The caching issue we experienced seems to be a Spark bug, because I see
> some inconsistent states through the SparkUI, but thanks for pointing out
> the potential reasons.
>
> I am still interested in for the people who run Zeppelin as a service,
> whether you have experienced bugs or memory leaks, and how did you deal
> with these.
>
> Thanks!
>
> Zhong
>
> On Tue, Mar 8, 2016 at 8:17 AM, Deenar Toraskar <deenar.toras...@gmail.com
> > wrote:
>
>> 1) You should turn dynamic allocation on see
>> http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
>> to maximise utilisation of your cluster resources. This might be a reason
>> you are seeing cached data disappearing.
>> 2) If other processes cache data and the amount of data cached is larger
>> than your cluster memory, Spark will evict some cached data from memory.
>> 3) If you are using Kerberos authentication, you need a process that
>> renews tickets.
>>
>> Deenar
>>
>> On 8 March 2016 at 01:35, Zhong Wang <wangzhong....@gmail.com> wrote:
>>
>>> Hi zeppelin-users,
>>>
>>> Because Zeppelin relies on a long running SparkContext, it is quite
>>> important to make it stable to improve availability. From my experience, I
>>> run into a couple of issues if I run a SparkContext for several days,
>>> including:
>>> --
>>> 1. EventLoggong doest work due to HDFS lease issue. Similar to this:
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3ccae6kwsp_c00gksmnx0obu5aouxphdjs-syqywt-jfi3psvc...@mail.gmail.com%3E
>>> 2. SparkUI is getting slower due to large number of history jobs
>>> 3. Cached data is gone mystically
>>>
>>> They may not be Zeppelin issues, but I would like to hear the problems
>>> you run into, and your experience of how to deal with maintaining a long
>>> running SparkContext.
>>>
>>> I know that we can do some cleanups periodically by restarting the spark
>>> interpreter, but I am wondering whether there are better ways.
>>>
>>> Thanks!
>>>
>>> Zhong
>>>
>>
>>
>

Re: Best practices of maintaining a long running SparkContext

Reply via email to