We ran 0.9.2 previously and didn't see the issue. We also saw that the spout would seem to stop, and the entire topology stall out for 5-10 mins at a time.
On Sun, Sep 13, 2015 at 2:57 PM, Kashyap Mhaisekar <[email protected]> wrote: > Thanks Steve, Enno, Martin. Only common thing between teh worker was the > gc logs that I configured. I dont find anything else. After i made the > changes there, what I also is that spout stops consuming and there are no > crashes of workers too. It just stops and nothing happens. > > I think it has to do with the number of messages being sent into the > system. If I keep the message level low (adjust maxx spout pending), then > the topology is up for 90 mins and counting. Otherwise, the system crashed > in 15 mins. What I was expecting was that the topology crashes and then > restarts, but that is exactly what was not happening. > > i tried it in 0.10.0-beta1 too and i found the same behavior. The last > prod version i had was 0.9.0-wip16 and there the 0mq was used. I did not > find issues there though. > > THanks > kashyap > > On Sep 13, 2015 15:39, "Stephen Powis" <[email protected]> wrote: > >> Kashyap - I see this same issue on 0.9.5 >> >> On Sun, Sep 13, 2015 at 9:58 AM, Enno Shioji <[email protected]> wrote: >> >>> There was a change in that area in 0.9.6 ( >>> https://issues.apache.org/jira/browse/STORM-763), although I'm not sure >>> if it will help your issue. >>> >>> >>> On Sun, Sep 13, 2015 at 2:35 PM, Kashyap Mhaisekar <[email protected]> >>> wrote: >>> >>>> Hmm. Thanks for the lead. On storm UI, the uptime for each executor >>>> except spout shows pretty much consistent values. Spout has crashed for >>>> sure. But then never comes up. Will check this up again. >>>> >>>> But the other question is - Is the Netty reconnects issue solved in >>>> 0.9.5? What is your storm version? >>>> >>>> Thanks >>>> Kashyap >>>> On Sep 13, 2015 08:04, "Martin Burian" <[email protected]> >>>> wrote: >>>> >>>>> They do restart after a while, yes. But if you don't see any error in >>>>> the log, it's weird. I encountered a case of workers not starting because >>>>> I >>>>> configured the worker JVM to expose JMX interface for remote monitoring on >>>>> a given port. Other workers on the same machine however could not start as >>>>> they failed to bind to the already used port. No error messages >>>>> whatsoever. >>>>> Might any such thing be your case? >>>>> >>>>> Othervise the cause should be logged somewhere. A worker is definitely >>>>> not running, or at least talking to the supervisor. You could try using >>>>> less workers to find out when/where the error occurs. >>>>> >>>>> Martin >>>>> >>>>> ne 13. 9. 2015 v 13:43 odesÃlatel Kashyap Mhaisekar < >>>>> [email protected]> napsal: >>>>> >>>>>> All worker logs have the same log. Workers are up. I am using only >>>>>> one box with multiple workers to test. >>>>>> Workers should be restarted of they fail right? So ideally, this >>>>>> error should be gone in a while.. >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> Kashyap >>>>>> On Sep 13, 2015 05:10, "Martin Burian" <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> When this appears in worker log, it means that the worker is trying >>>>>>> to connect to another worker, but the other is not running. What do you >>>>>>> see >>>>>>> in worker-6707.log? Is the other worker runing? >>>>>>> Matrin >>>>>>> >>>>>>> ne 13. 9. 2015 v 6:06 odesÃlatel Kashyap Mhaisekar < >>>>>>> [email protected]> napsal: >>>>>>> >>>>>>>> Also, >>>>>>>> Is there a way to switch back to 0mq from Netty? If so, what needs >>>>>>>> to be done? >>>>>>>> >>>>>>>> Thanks >>>>>>>> kashyap >>>>>>>> >>>>>>>> On Sat, Sep 12, 2015 at 10:49 PM, Kashyap Mhaisekar < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Am having a Netty related issues in my storm cluster where the >>>>>>>>> spout stops consuming after a while. The corresponding worker logs >>>>>>>>> show - >>>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [ERROR] connection >>>>>>>>> attempt 26 to >>>>>>>>> Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707 >>>>>>>>> <http://Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707> >>>>>>>>> failed: java.lang.RuntimeException: Returned channel was actually not >>>>>>>>> established* >>>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [INFO] connection >>>>>>>>> attempt 27 to Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707 >>>>>>>>> <http://Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707> >>>>>>>>> scheduled to >>>>>>>>> run in 392 ms* >>>>>>>>> *2015-09-12T23:28:23.784-0400 b.s.m.n.Client [ERROR] connection >>>>>>>>> attempt 27 to Netty-Client-**serverstorm1.myorg.com >>>>>>>>> <http://serverstorm1.myorg.com>**/10.2.70.18:6707 >>>>>>>>> <http://10.2.70.18:6707> failed: java.lang.RuntimeException: Returned >>>>>>>>> channel was actually not established* >>>>>>>>> >>>>>>>>> The corresponding supervisor logs had >>>>>>>>> *2015-09-12T23:28:23.018-0400 b.s.d.supervisor [INFO] >>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started* >>>>>>>>> *2015-09-12T23:28:23.518-0400 b.s.d.supervisor [INFO] >>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started* >>>>>>>>> *2015-09-12T23:28:24.019-0400 b.s.d.supervisor [INFO] >>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started* >>>>>>>>> >>>>>>>>> I had storm version 0.9.3 when this issue occurred and had >>>>>>>>> upgraded to 0.9.4 and 0.9.5 to seek relief, but the issue still >>>>>>>>> persists. >>>>>>>>> Am not sure what else to do. Am not even sure why this issue occurs >>>>>>>>> and >>>>>>>>> what triggers it. Any help would be great and appreciated. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Kashyap >>>>>>>>> >>>>>>>>> >>>>>>>> >>> >>
