You may need to configure your cluster to give it more time to start up.
Additionally, knowing how long it can take to load the Stanford NLP models,
make sure you're only doing it in a single bolt instance (e.g. static
initializer or double-check synch) and sharing it between all your bolt
instances.

supervisor.worker.start.timeout.secs 120
supervisor.worker.timeout.secs 60

I'd try tuning your worker start timeout here. Try setting it up to 300s
and (again) ensuring your prepare method only initializes expensive
resources once, then shares them between instances in the JVM.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
mich...@fullcontact.com


On Tue, Feb 18, 2014 at 1:45 PM, Eddie Santos <easan...@ualberta.ca> wrote:

> Hi all,
>
> How do you get bolts that take a ludicrously long time to load (we're
> talking minutes here) to cooperate with Zookeeper?
>
> I may not be understanding my problem properly, but on my test cluster
> (**not** in local mode!) my bolt keeps getting restarted in the middle of
> its prepare() method -- which may take up to two minutes to return.
>
> The problem seems to be the " Client session timed out", but I'm not
> knowledgable enough with Zookeeper to really know how to fix this.
>
> Here's a portion of logs from the supervisor affected. The STDIO messages
> come from a poorly-coded third party library that I have to use.
>
>     2014-01-17 23:19:28 o.a.z.ClientCnxn [INFO] Client session timed out,
> have not heard from server in 2747ms for sessionid 0x143a22eb4060078,
> closing socket connection and attempting reconnect
>     2014-01-17 23:19:28 b.s.d.worker [DEBUG] Doing heartbeat
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000768,
> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
> :port 6702}
>     2014-01-17 23:19:28 b.s.d.worker [DEBUG] Doing heartbeat
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000768,
> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
> :port 6702}
>     2014-01-17 23:19:28 c.n.c.f.s.ConnectionStateManager [INFO] State
> change: SUSPENDED
>     2014-01-17 23:19:28 c.n.c.f.s.ConnectionStateManager [WARN] There are
> no ConnectionStateListeners registered.
>     2014-01-17 23:19:28 b.s.cluster [WARN] Received event
> :disconnected::none: with disconnected Zookeeper.
>     2014-01-17 23:19:28 b.s.cluster [WARN] Received event
> :disconnected::none: with disconnected Zookeeper.
>     2014-01-17 23:19:28 STDIO [ERROR] done [7.2 sec].
>     2014-01-17 23:19:28 STDIO [ERROR] Adding annotator lemma
>     2014-01-17 23:19:28 STDIO [ERROR] Adding annotator ner
>     2014-01-17 23:19:28 STDIO [ERROR] Loading classifier from
> edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
>     2014-01-17 23:19:28 STDIO [ERROR] ...
>     2014-01-17 23:19:29 b.s.d.worker [DEBUG] Doing heartbeat
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000769,
> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
> :port 6702}
>     2014-01-17 23:19:29 b.s.d.worker [DEBUG] Doing heartbeat
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000769,
> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
> :port 6702}
>     2014-01-17 23:19:30 o.a.z.ClientCnxn [INFO] Opening socket connection
> to server zookeeper/192.168.50.3:2181
>
>   ^-- This is where the bolt gets restarted in its initialization.
>
> Thanks,
> Eddie
>

Reply via email to