Re: Netty Reconnect issues on 0.9.3, 0.9.4, 0.9.5

Martin Burian Mon, 14 Sep 2015 03:35:21 -0700

Your description (runs 15mins and then stalls, runs fine when throttled)
seems much like the messages fill up your memory because the rest of the
topology cannot handle them and then GC eats up all your performance. Do
the GC logs show any excessive GC activity? Default worker heap size is
768M, you might need to modify that if GC is the case.
Martin


po 14. 9. 2015 v 3:38 odesílatel Stephen Powis <[email protected]>
napsal:

> We ran 0.9.2 previously and didn't see the issue.  We also saw that the
> spout would seem to stop, and the entire topology stall out for 5-10 mins
> at a time.
>
> On Sun, Sep 13, 2015 at 2:57 PM, Kashyap Mhaisekar <[email protected]>
> wrote:
>
>> Thanks Steve, Enno, Martin. Only common thing between teh worker was the
>> gc logs that I configured. I dont find anything else. After i made the
>> changes there, what I also is that spout stops consuming and there are no
>> crashes of workers too. It just stops and nothing happens.
>>
>> I think it has to do with the number of messages being sent into the
>> system. If I keep the message level low (adjust maxx spout pending), then
>> the topology is up for 90 mins and counting. Otherwise, the system crashed
>> in 15 mins. What I was expecting was that the topology crashes and then
>> restarts, but that is exactly what was not happening.
>>
>> i tried it in 0.10.0-beta1 too and i found the same behavior. The last
>> prod version i had was 0.9.0-wip16 and there the 0mq was used. I did not
>> find issues there though.
>>
>> THanks
>> kashyap
>>
>> On Sep 13, 2015 15:39, "Stephen Powis" <[email protected]> wrote:
>>
>>> Kashyap -  I see this same issue on 0.9.5
>>>
>>> On Sun, Sep 13, 2015 at 9:58 AM, Enno Shioji <[email protected]> wrote:
>>>
>>>> There was a change in that area in 0.9.6 (
>>>> https://issues.apache.org/jira/browse/STORM-763), although I'm not
>>>> sure if it will help your issue.
>>>>
>>>>
>>>> On Sun, Sep 13, 2015 at 2:35 PM, Kashyap Mhaisekar <[email protected]
>>>> > wrote:
>>>>
>>>>> Hmm. Thanks for the lead. On storm UI, the uptime for each executor
>>>>> except spout shows pretty much consistent values. Spout has crashed for
>>>>> sure. But then never comes up. Will check this up again.
>>>>>
>>>>> But the other question is - Is the Netty reconnects issue solved in
>>>>> 0.9.5? What is your storm version?
>>>>>
>>>>> Thanks
>>>>> Kashyap
>>>>> On Sep 13, 2015 08:04, "Martin Burian" <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> They do restart after a while, yes. But if you don't see any error in
>>>>>> the log, it's weird. I encountered a case of workers not starting 
>>>>>> because I
>>>>>> configured the worker JVM to expose JMX interface for remote monitoring 
>>>>>> on
>>>>>> a given port. Other workers on the same machine however could not start 
>>>>>> as
>>>>>> they failed to bind to the already used port. No error messages 
>>>>>> whatsoever.
>>>>>> Might any such thing be your case?
>>>>>>
>>>>>> Othervise the cause should be logged somewhere. A worker is
>>>>>> definitely not running, or at least talking to the supervisor. You could
>>>>>> try using less workers to find out when/where the error occurs.
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>> ne 13. 9. 2015 v 13:43 odesílatel Kashyap Mhaisekar <
>>>>>> [email protected]> napsal:
>>>>>>
>>>>>>> All worker logs have the same log. Workers are up. I am using only
>>>>>>> one box with multiple workers to test.
>>>>>>> Workers should be restarted of they fail right? So ideally, this
>>>>>>> error should be gone in a while..
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> Kashyap
>>>>>>> On Sep 13, 2015 05:10, "Martin Burian" <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> When this appears in worker log, it means that the worker is trying
>>>>>>>> to connect to another worker, but the other is not running. What do 
>>>>>>>> you see
>>>>>>>> in worker-6707.log? Is the other worker runing?
>>>>>>>> Matrin
>>>>>>>>
>>>>>>>> ne 13. 9. 2015 v 6:06 odesílatel Kashyap Mhaisekar <
>>>>>>>> [email protected]> napsal:
>>>>>>>>
>>>>>>>>> Also,
>>>>>>>>> Is there a way to switch back to 0mq from Netty? If so, what needs
>>>>>>>>> to be done?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> kashyap
>>>>>>>>>
>>>>>>>>> On Sat, Sep 12, 2015 at 10:49 PM, Kashyap Mhaisekar <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Am having a Netty related issues in my storm cluster where the
>>>>>>>>>> spout stops consuming after a while. The corresponding worker logs 
>>>>>>>>>> show -
>>>>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [ERROR] connection
>>>>>>>>>> attempt 26 to
>>>>>>>>>> Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707
>>>>>>>>>> <http://Netty-Client-trsttel2pascapp01.vm.itg.corp.us.shldcorp.com/10.2.70.18:6707>
>>>>>>>>>> failed: java.lang.RuntimeException: Returned channel was actually not
>>>>>>>>>> established*
>>>>>>>>>> *2015-09-12T23:28:23.391-0400 b.s.m.n.Client [INFO] connection
>>>>>>>>>> attempt 27 to Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707
>>>>>>>>>> <http://Netty-Client-serverstorm1.myorg.com/10.2.70.18:6707> 
>>>>>>>>>> scheduled to
>>>>>>>>>> run in 392 ms*
>>>>>>>>>> *2015-09-12T23:28:23.784-0400 b.s.m.n.Client [ERROR] connection
>>>>>>>>>> attempt 27 to Netty-Client-**serverstorm1.myorg.com
>>>>>>>>>> <http://serverstorm1.myorg.com>**/10.2.70.18:6707
>>>>>>>>>> <http://10.2.70.18:6707> failed: java.lang.RuntimeException: Returned
>>>>>>>>>> channel was actually not established*
>>>>>>>>>>
>>>>>>>>>> The corresponding supervisor logs had
>>>>>>>>>> *2015-09-12T23:28:23.018-0400 b.s.d.supervisor [INFO]
>>>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>>>>>>>> *2015-09-12T23:28:23.518-0400 b.s.d.supervisor [INFO]
>>>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>>>>>>>> *2015-09-12T23:28:24.019-0400 b.s.d.supervisor [INFO]
>>>>>>>>>> 32e3f906-3869-4f0c-ac1c-4916615daf99 still hasn't started*
>>>>>>>>>>
>>>>>>>>>> I had storm version 0.9.3 when this issue occurred and had
>>>>>>>>>> upgraded to 0.9.4 and 0.9.5 to seek relief, but the issue still 
>>>>>>>>>> persists.
>>>>>>>>>> Am not sure what else to do. Am not even sure why this issue occurs 
>>>>>>>>>> and
>>>>>>>>>> what triggers it. Any help would be great and appreciated.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Kashyap
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>
>>>
>

Re: Netty Reconnect issues on 0.9.3, 0.9.4, 0.9.5

Reply via email to