We are having the same problem. We're running Spark 0.9.1 in standalone
mode and on some heavy jobs workers become unresponsive and marked by
master as dead, even though the worker process is still running. Then they
never join the cluster again and cluster becomes essentially unusable until
we restart each worker.

We'd like to know:
1. Why worker can become unresponsive? Are there any well known config /
usage pitfalls that we could have fallen into? We're still investigating
the issue, but maybe there are some hints?
2. Is there an option to auto-recover a worker? e.g. automatically start a
new one if the old one failed? or at least some hooks to implement
functionality liek that?

Thanks,
Piotr


2014-06-13 22:58 GMT+02:00 Gino Bustelo <g...@bustelos.com>:

> I get the same problem, but I'm running in a dev environment based on
> docker scripts. The additional issue is that the worker processes do not
> die and so the docker container does not exit. So I end up with worker
> containers that are not participating in the cluster.
>
>
> On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi <mayur.rust...@gmail.com>
> wrote:
>
>> I have also had trouble in worker joining the working set. I have
>> typically moved to Mesos based setup. Frankly for high availability you are
>> better off using a cluster manager.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <yana.kadiy...@gmail.com>
>> wrote:
>>
>>> Hi, I see this has been asked before but has not gotten any satisfactory
>>> answer so I'll try again:
>>>
>>> (here is the original thread I found:
>>> http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E
>>> )
>>>
>>> I have a set of workers dying and coming back again. The master prints
>>> the following warning:
>>>
>>> "Got heartbeat from unregistered worker ...."
>>>
>>> What is the solution to this -- rolling the master is very undesirable
>>> to me as I have a Shark context sitting on top of it (it's meant to be
>>> highly available).
>>>
>>> Insights appreciated -- I don't think an executor going down is very
>>> unexpected but it does seem odd that it won't be able to rejoin the working
>>> set.
>>>
>>> I'm running Spark 0.9.1 on CDH
>>>
>>>
>>>
>>
>


-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404

Reply via email to