Re: Weirdness running topology on multiple nodes
Hi Justin, Can you share your storm.yaml config file? Do you have any firewall software running on any of the machines in your cluster? - Taylor On May 7, 2014, at 11:11 AM, Justin Workman justinjwork...@gmail.com wrote: We have spent the better part of 2 weeks now trying to get a pretty basic topology running across multiple nodes. I am sure I am missing something simple but for the life of me I cannot figure it out. Here is the situation, I have 1 nimbus server and 5 supervisor servers, with Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. If all of the guests are running on the same physical hyperisor then the topology starts up just fine and runs without any issues. However, if we take the guests and spread them out over multiple hypervisors ( in the same OpenStack cluster ), the topology never really completely starts up. Things start to run, some messages are pulled off the spout, but nothing ever makes it all the way through the topology and nothing is ever ack'd. In the worker logs we get messages about reconnecting and eventually a Remote host unreachable error, and Async Loop Died. This used to result in a NumberFormat exception, reducing the netty retries from 30 to 10 resloved the NumberFormat error, and not we get the following 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) ~[na:na] at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) ~[na:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:125) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319) ~[na:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308) ~[na:na] at backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58) ~[na:na] And in the supervisor logs we see errors about the workers timing out and not starting up all the way, we also see executor timeouts in the nimbus logs. But we do not see any errors in the Zookeeper logs and the Zookeeper stats look fine. There do not appear to be any real network issues, I can run a continuous flood ping, between the hosts, with varying packet sizes, with minimal latency, and no dropped packets. I have also attempted to add all hosts to the local hosts files on each machine without any difference. I have also played with adjusting the different heartbeat timeouts and intervals with out any luck, and I have also deployed this same setup to a 5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local disks ), and we had the same issue. Topology would start, but data ever made it through the topology. The only way I have ever been able to get the topology to work is under OpenStack when all guests are on the same physical hypervisor. I think I am just missing something very obvious, but I am going in circles at this point and could use some additional suggestions. Thanks Justin signature.asc Description: Message signed with OpenPGP
Re: Weirdness running topology on multiple nodes
That is odd. I have seen things like this happen when there are DNS configuration issues, but you have even updated /etc/hosts. * What does /etc/nsswitch.conf have for the hosts entry? This is what mine has: hosts: files dns I think that the java resolver code honors this setting, and this will cause it to look at /etc/hosts first for resolution. * Firewall settings could also cause this. (Pings would work while worker-worker communications might not.) * Failing that, maybe watch network packets to discover with what the workers really trying to communicate? -- Derek On 5/7/14, 10:11, Justin Workman wrote: We have spent the better part of 2 weeks now trying to get a pretty basic topology running across multiple nodes. I am sure I am missing something simple but for the life of me I cannot figure it out. Here is the situation, I have 1 nimbus server and 5 supervisor servers, with Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. If all of the guests are running on the same physical hyperisor then the topology starts up just fine and runs without any issues. However, if we take the guests and spread them out over multiple hypervisors ( in the same OpenStack cluster ), the topology never really completely starts up. Things start to run, some messages are pulled off the spout, but nothing ever makes it all the way through the topology and nothing is ever ack'd. In the worker logs we get messages about reconnecting and eventually a Remote host unreachable error, and Async Loop Died. This used to result in a NumberFormat exception, reducing the netty retries from 30 to 10 resloved the NumberFormat error, and not we get the following 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) ~[na:na] at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) ~[na:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:125) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319) ~[na:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308) ~[na:na] at backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58) ~[na:na] And in the supervisor logs we see errors about the workers timing out and not starting up all the way, we also see executor timeouts in the nimbus logs. But we do not see any errors in the Zookeeper logs and the Zookeeper stats look fine. There do not appear to be any real network issues, I can run a continuous flood ping, between the hosts, with varying packet sizes, with minimal latency, and no dropped packets. I have also attempted to add all hosts to the local hosts files on each machine without any difference. I have also played with adjusting the different heartbeat timeouts and intervals with out any luck, and I have also deployed this same setup to a 5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local disks ), and we had the same issue. Topology would start, but data ever made it through the topology. The only way I have ever been able to get the topology to work
Weirdness running topology on multiple nodes
We have spent the better part of 2 weeks now trying to get a pretty basic topology running across multiple nodes. I am sure I am missing something simple but for the life of me I cannot figure it out. Here is the situation, I have 1 nimbus server and 5 supervisor servers, with Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. If all of the guests are running on the same physical hyperisor then the topology starts up just fine and runs without any issues. However, if we take the guests and spread them out over multiple hypervisors ( in the same OpenStack cluster ), the topology never really completely starts up. Things start to run, some messages are pulled off the spout, but nothing ever makes it all the way through the topology and nothing is ever ack'd. In the worker logs we get messages about reconnecting and eventually a Remote host unreachable error, and Async Loop Died. This used to result in a NumberFormat exception, reducing the netty retries from 30 to 10 resloved the NumberFormat error, and not we get the following 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) ~[na:na] at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) ~[na:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:125) ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319) ~[na:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308) ~[na:na] at backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58) ~[na:na] And in the supervisor logs we see errors about the workers timing out and not starting up all the way, we also see executor timeouts in the nimbus logs. But we do not see any errors in the Zookeeper logs and the Zookeeper stats look fine. There do not appear to be any real network issues, I can run a continuous flood ping, between the hosts, with varying packet sizes, with minimal latency, and no dropped packets. I have also attempted to add all hosts to the local hosts files on each machine without any difference. I have also played with adjusting the different heartbeat timeouts and intervals with out any luck, and I have also deployed this same setup to a 5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local disks ), and we had the same issue. Topology would start, but data ever made it through the topology. The only way I have ever been able to get the topology to work is under OpenStack when all guests are on the same physical hypervisor. I think I am just missing something very obvious, but I am going in circles at this point and could use some additional suggestions. Thanks Justin