Re: Weirdness running topology on multiple nodes

2014-05-16 Thread P. Taylor Goetz
Hi Justin,

Can you share your storm.yaml config file?

Do you have any firewall software running on any of the machines in your 
cluster?

- Taylor

On May 7, 2014, at 11:11 AM, Justin Workman justinjwork...@gmail.com wrote:

 We have spent the better part of 2 weeks now trying to get a pretty basic 
 topology running across multiple nodes. I am sure I am missing something 
 simple but for the life of me I cannot figure it out.
 
 Here is the situation, I have 1 nimbus server and 5 supervisor servers, with 
 Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts 
 are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. 
 If all of the guests are running on the same physical hyperisor then the 
 topology starts up just fine and runs without any issues. However, if we take 
 the guests and spread them out over multiple hypervisors ( in the same 
 OpenStack cluster ), the topology never really completely starts up. Things 
 start to run, some messages are pulled off the spout, but nothing ever makes 
 it all the way through the topology and nothing is ever ack'd. 
 
 In the worker logs we get messages about reconnecting and eventually a Remote 
 host unreachable error, and Async Loop Died. This used to result in a 
 NumberFormat exception, reducing the netty retries from 30 to 10 resloved the 
 NumberFormat error, and not we get the following
 
 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9]
 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
 will close this client.
 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
 will close this client.
 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
 will close this client.
 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
 will close this client.
 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died!
 java.lang.RuntimeException: java.lang.RuntimeException: Client is being 
 closed, and does not take requests any more
 at 
 backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107)
  ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at 
 backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78)
  ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at 
 backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77)
  ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at 
 backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) 
 ~[na:na]
 at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) 
 ~[na:na]
 at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
 at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
 Caused by: java.lang.RuntimeException: Client is being closed, and does not 
 take requests any more
 at backtype.storm.messaging.netty.Client.send(Client.java:125) 
 ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at 
 backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319)
  ~[na:na]
 at 
 backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308)
  ~[na:na]
 at 
 backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58)
  ~[na:na]
 
 And in the supervisor logs we see errors about the workers timing out and not 
 starting up all the way, we also see executor timeouts in the nimbus logs. 
 But we do not see any errors in the Zookeeper logs and the Zookeeper stats 
 look fine. 
 
 There do not appear to be any real network issues, I can run a continuous 
 flood ping, between the hosts, with varying packet sizes, with minimal 
 latency, and no dropped packets. I have also attempted to add all hosts to 
 the local hosts files on each machine without any difference. 
 
 I have also played with adjusting the different heartbeat timeouts and 
 intervals with out any luck, and I have also deployed this same setup to a 5 
 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local 
 disks ), and we had the same issue. Topology would start, but data ever made 
 it through the topology.
 
 The only way I have ever been able to get the topology to work is under 
 OpenStack when all guests are on the same physical hypervisor. I think I am 
 just missing something very obvious, but I am going in circles at this point 
 and could use some additional suggestions.
 
 Thanks
 Justin



signature.asc
Description: Message signed with OpenPGP 

Re: Weirdness running topology on multiple nodes

2014-05-16 Thread Derek Dagit

That is odd.  I have seen things like this happen when there are DNS 
configuration issues, but you have even updated /etc/hosts.


* What does /etc/nsswitch.conf have for the hosts entry?

This is what mine has:
hosts:  files dns

I think that the java resolver code honors this setting, and this will cause it 
to look at /etc/hosts first for resolution.


* Firewall settings could also cause this.  (Pings would work while 
worker-worker communications might not.)


* Failing that, maybe watch network packets to discover with what the workers 
really trying to communicate?
--
Derek

On 5/7/14, 10:11, Justin Workman wrote:

We have spent the better part of 2 weeks now trying to get a pretty basic
topology running across multiple nodes. I am sure I am missing something
simple but for the life of me I cannot figure it out.

Here is the situation, I have 1 nimbus server and 5 supervisor servers,
with Zookeeper running on the nimbus server and 2 supervisor nodes. These
hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack
deployment. If all of the guests are running on the same physical hyperisor
then the topology starts up just fine and runs without any issues. However,
if we take the guests and spread them out over multiple hypervisors ( in
the same OpenStack cluster ), the topology never really completely starts
up. Things start to run, some messages are pulled off the spout, but
nothing ever makes it all the way through the topology and nothing is ever
ack'd.

In the worker logs we get messages about reconnecting and eventually a
Remote host unreachable error, and Async Loop Died. This used to result in
a NumberFormat exception, reducing the netty retries from 30 to 10 resloved
the NumberFormat error, and not we get the following

2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:53 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: Client is being
closed, and does not take requests any more
 at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89)
~[na:na]
 at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433)
~[na:na]
 at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
 at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
Caused by: java.lang.RuntimeException: Client is being closed, and does not
take requests any more
 at backtype.storm.messaging.netty.Client.send(Client.java:125)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
 at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319)
~[na:na]
 at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308)
~[na:na]
 at
backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58)
~[na:na]

And in the supervisor logs we see errors about the workers timing out and
not starting up all the way, we also see executor timeouts in the nimbus
logs. But we do not see any errors in the Zookeeper logs and the Zookeeper
stats look fine.

There do not appear to be any real network issues, I can run a continuous
flood ping, between the hosts, with varying packet sizes, with minimal
latency, and no dropped packets. I have also attempted to add all hosts to
the local hosts files on each machine without any difference.

I have also played with adjusting the different heartbeat timeouts and
intervals with out any luck, and I have also deployed this same setup to a
5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local
disks ), and we had the same issue. Topology would start, but data ever
made it through the topology.

The only way I have ever been able to get the topology to work 

Weirdness running topology on multiple nodes

2014-05-15 Thread Justin Workman
We have spent the better part of 2 weeks now trying to get a pretty basic
topology running across multiple nodes. I am sure I am missing something
simple but for the life of me I cannot figure it out.

Here is the situation, I have 1 nimbus server and 5 supervisor servers,
with Zookeeper running on the nimbus server and 2 supervisor nodes. These
hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack
deployment. If all of the guests are running on the same physical hyperisor
then the topology starts up just fine and runs without any issues. However,
if we take the guests and spread them out over multiple hypervisors ( in
the same OpenStack cluster ), the topology never really completely starts
up. Things start to run, some messages are pulled off the spout, but
nothing ever makes it all the way through the topology and nothing is ever
ack'd.

In the worker logs we get messages about reconnecting and eventually a
Remote host unreachable error, and Async Loop Died. This used to result in
a NumberFormat exception, reducing the netty retries from 30 to 10 resloved
the NumberFormat error, and not we get the following

2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:53 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: Client is being
closed, and does not take requests any more
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at
backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89)
~[na:na]
at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433)
~[na:na]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
Caused by: java.lang.RuntimeException: Client is being closed, and does not
take requests any more
at backtype.storm.messaging.netty.Client.send(Client.java:125)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319)
~[na:na]
at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308)
~[na:na]
at
backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58)
~[na:na]

And in the supervisor logs we see errors about the workers timing out and
not starting up all the way, we also see executor timeouts in the nimbus
logs. But we do not see any errors in the Zookeeper logs and the Zookeeper
stats look fine.

There do not appear to be any real network issues, I can run a continuous
flood ping, between the hosts, with varying packet sizes, with minimal
latency, and no dropped packets. I have also attempted to add all hosts to
the local hosts files on each machine without any difference.

I have also played with adjusting the different heartbeat timeouts and
intervals with out any luck, and I have also deployed this same setup to a
5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local
disks ), and we had the same issue. Topology would start, but data ever
made it through the topology.

The only way I have ever been able to get the topology to work is under
OpenStack when all guests are on the same physical hypervisor. I think I am
just missing something very obvious, but I am going in circles at this
point and could use some additional suggestions.

Thanks
Justin