Re: Netty Reconnect issues on 0.9.3, 0.9.4, 0.9.5

Stephen Powis Sun, 13 Sep 2015 18:39:07 -0700

We ran 0.9.2 previously and didn't see the issue.  We also saw that the
spout would seem to stop, and the entire topology stall out for 5-10 mins
at a time.


On Sun, Sep 13, 2015 at 2:57 PM, Kashyap Mhaisekar <[email protected]>
wrote:

> Thanks Steve, Enno, Martin. Only common thing between teh worker was the
> gc logs that I configured. I dont find anything else. After i made the
> changes there, what I also is that spout stops consuming and there are no
> crashes of workers too. It just stops and nothing happens.
>
> I think it has to do with the number of messages being sent into the
> system. If I keep the message level low (adjust maxx spout pending), then
> the topology is up for 90 mins and counting. Otherwise, the system crashed
> in 15 mins. What I was expecting was that the topology crashes and then
> restarts, but that is exactly what was not happening.
>
> i tried it in 0.10.0-beta1 too and i found the same behavior. The last
> prod version i had was 0.9.0-wip16 and there the 0mq was used. I did not
> find issues there though.
>
> THanks
> kashyap
>
> On Sep 13, 2015 15:39, "Stephen Powis" <[email protected]> wrote:
>
>> Kashyap -  I see this same issue on 0.9.5
>>
>> On Sun, Sep 13, 2015 at 9:58 AM, Enno Shioji <[email protected]> wrote:
>>
>>> There was a change in that area in 0.9.6 (
>>> https://issues.apache.org/jira/browse/STORM-763), although I'm not sure
>>> if it will help your issue.
>>>
>>>
>>> On Sun, Sep 13, 2015 at 2:35 PM, Kashyap Mhaisekar <[email protected]>
>>> wrote:
>>>
>>>> Hmm. Thanks for the lead. On storm UI, the uptime for each executor
>>>> except spout shows pretty much consistent values. Spout has crashed for
>>>> sure. But then never comes up. Will check this up again.
>>>>
>>>> But the other question is - Is the Netty reconnects issue solved in
>>>> 0.9.5? What is your storm version?
>>>>
>>>> Thanks
>>>> Kashyap
>>>> On Sep 13, 2015 08:04, "Martin Burian" <[email protected]>
>>>> wrote:
>>>>
>>>>> They do restart after a while, yes. But if you don't see any error in
>>>>> the log, it's weird. I encountered a case of workers not starting because 
>>>>> I
>>>>> configured the worker JVM to expose JMX interface for remote monitoring on
>>>>> a given port. Other workers on the same machine however could not start as
>>>>> they failed to bind to the already used port. No error messages 
>>>>> whatsoever.
>>>>> Might any such thing be your case?
>>>>>
>>>>> Othervise the cause should be logged somewhere. A worker is definitely
>>>>> not running, or at least talking to the supervisor. You could try using
>>>>> less workers to find out when/where the error occurs.
>>>>>
>>>>> Martin
>>>>>
>>>>> ne 13. 9. 2015 v 13:43 odesílatel Kashyap Mhaisekar <
>>>>> [email protected]> napsal:
>>>>>
>>>>>> All worker logs have the same log. Workers are up. I am using only
>>>>>> one box with multiple workers to test.
>>>>>> Workers should be restarted of they fail right? So ideally, this
>>>>>> error should be gone in a while..
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> Kashyap
>>>>>> On Sep 13, 2015 05:10, "Martin Burian" <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> When this appears in worker log, it means that the worker is trying
>>>>>>> to connect to another worker, but the other is not running. What do you 
>>>>>>> see
>>>>>>> in worker-6707.log? Is the other worker runing?
>>>>>>> Matrin
>>>>>>>
>>>>>>> ne 13. 9. 2015 v 6:06 odesílatel Kashyap Mhaisekar <
>>>>>>> [email protected]> napsal:
>>>>>>>
>>>>>>>> Also,
>>>>>>>> Is there a way to switch back to 0mq from Netty? If so, what needs
>>>>>>>> to be done?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> kashyap
>>>>>>>>
>>>>>>>> On Sat, Sep 12, 2015 at 10:49 PM, Kashyap Mhaisekar <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Am having a Netty related issues in my storm cluster where the
>>>>>>>>> spout stops consuming after a while. The corresponding worker logs 
>>>>>>>>> show -
>>>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [ERROR] connection
>>>>>>>>> attempt 26 to
>>>>>>>>> Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707
>>>>>>>>> <http://Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707>
>>>>>>>>> failed: java.lang.RuntimeException: Returned channel was actually not
>>>>>>>>> established*
>>>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [INFO] connection
>>>>>>>>> attempt 27 to Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707
>>>>>>>>> <http://Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707> 
>>>>>>>>> scheduled to
>>>>>>>>> run in 392 ms*
>>>>>>>>> *2015-09-12T23:28:23.784-0400 b.s.m.n.Client [ERROR] connection
>>>>>>>>> attempt 27 to Netty-Client-**serverstorm1.myorg.com
>>>>>>>>> <http://serverstorm1.myorg.com>**/10.2.70.18:6707
>>>>>>>>> <http://10.2.70.18:6707> failed: java.lang.RuntimeException: Returned
>>>>>>>>> channel was actually not established*
>>>>>>>>>
>>>>>>>>> The corresponding supervisor logs had
>>>>>>>>> *2015-09-12T23:28:23.018-0400 b.s.d.supervisor [INFO]
>>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>>>>>>> *2015-09-12T23:28:23.518-0400 b.s.d.supervisor [INFO]
>>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>>>>>>> *2015-09-12T23:28:24.019-0400 b.s.d.supervisor [INFO]
>>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>>>>>>>
>>>>>>>>> I had storm version 0.9.3 when this issue occurred and had
>>>>>>>>> upgraded to 0.9.4 and 0.9.5 to seek relief, but the issue still 
>>>>>>>>> persists.
>>>>>>>>> Am not sure what else to do. Am not even sure why this issue occurs 
>>>>>>>>> and
>>>>>>>>> what triggers it. Any help would be great and appreciated.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Kashyap
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>
>>

Re: Netty Reconnect issues on 0.9.3, 0.9.4, 0.9.5

Reply via email to