Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Fang Chen Thu, 18 Jun 2015 11:47:02 -0700

tried. no effect.

Thanks,
Fang


On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van <binhn...@gmail.com> wrote:

> Can you try this.
>
> Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so
> that YGC happen once every 2-3 seconds?
> If that fix the issue then I think GC is the cause of your problem.
>
> On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <fc2...@gmail.com> wrote:
>
>> We use storm bare bones, not trident as it's too expensive for our use
>> cases.  The jvm options for supervisor is listed below but it might not be
>> optimal in any sense.
>>
>> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6
>> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly
>> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000
>> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions
>> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
>> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure
>> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC
>> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log
>> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
>> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998
>> -Dcom.sun.management.jmxremote.ssl=false
>> -Dcom.sun.management.jmxremote.authenticate=false"
>>
>>
>>
>> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <binhn...@gmail.com>
>> wrote:
>>
>>> Just to be sure, are you using Storm or Storm Trident?
>>> Also can you share the current setting of your supervisor.child_opts?
>>>
>>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc2...@gmail.com> wrote:
>>>
>>>> I did enable gc for both worker and supervisor and found nothing
>>>> abnormal (pause is minimal and frequency is normal too).  I tried max
>>>> spound pending of both 1000 and 500.
>>>>
>>>> Fang
>>>>
>>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <binhn...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Fang,
>>>>>
>>>>> Did you check your GC log? Do you see anything abnormal?
>>>>> What is your current max spout pending setting?
>>>>>
>>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc2...@gmail.com> wrote:
>>>>>
>>>>>> I also did this and find no success.
>>>>>>
>>>>>> Thanks,
>>>>>> Fang
>>>>>>
>>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <ncle...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> After I wrote that I realized you tried empty topology anyways.
>>>>>>> This should reduce any gc or worker initialization related failures 
>>>>>>> though
>>>>>>> they are still possible.  As Erik mentioned check ZK.  Also I'm not 
>>>>>>> sure if
>>>>>>> this is still required but it used to be helpful to make sure your storm
>>>>>>> nodes have each other listed in /etc/hosts.
>>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <ncle...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Make sure your topology is starting up in the allotted time, and if
>>>>>>>> not try increasing the startup timeout.
>>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc2...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Erik
>>>>>>>>>
>>>>>>>>> Thanks for your reply!  It's great to hear about real production
>>>>>>>>> usages. For our use case, we are really puzzled by the outcome so 
>>>>>>>>> far. The
>>>>>>>>> initial investigation seems to indicate that workers don't die by
>>>>>>>>> themselves ( i actually tried killing the supervisor and the worker 
>>>>>>>>> would
>>>>>>>>> continue running beyond 30 minutes).
>>>>>>>>>
>>>>>>>>> The sequence of events is like this:  supervisor immediately
>>>>>>>>> complains worker "still has not started" for a few seconds right after
>>>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus
>>>>>>>>> complains executors (related to the worker) "not alive" and started to
>>>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor 
>>>>>>>>> shuts
>>>>>>>>> down its worker --> other peer workers complain about netty issues. 
>>>>>>>>> and the
>>>>>>>>> loop goes on.
>>>>>>>>>
>>>>>>>>> Could you kindly tell me what version of zookeeper is used with
>>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster?
>>>>>>>>>
>>>>>>>>> I wonder if this is due to zookeeper issues.
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> Fang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers <
>>>>>>>>> eweath...@groupon.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Fang,
>>>>>>>>>>
>>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and
>>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of
>>>>>>>>>> 30+ nodes.
>>>>>>>>>>
>>>>>>>>>> One of the challenges with storm is figuring out what the root
>>>>>>>>>> cause is when things go haywire.  You'll wanna examine why the nimbus
>>>>>>>>>> decided to restart your worker processes.  It would happen when 
>>>>>>>>>> workers die
>>>>>>>>>> and the nimbus notices that storm executors aren't alive.  (There 
>>>>>>>>>> are logs
>>>>>>>>>> in nimbus for this.)  Then you'll wanna dig into why the workers 
>>>>>>>>>> died by
>>>>>>>>>> looking at logs on the worker hosts.
>>>>>>>>>>
>>>>>>>>>> - Erik
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc2...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
>>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), 
>>>>>>>>>>> and
>>>>>>>>>>> unfortunately we could not even have a clean run for over 30 
>>>>>>>>>>> minutes on a
>>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these 
>>>>>>>>>>> nodes but on
>>>>>>>>>>> different disks.
>>>>>>>>>>>
>>>>>>>>>>> I have huge troubles to give my data analytics topology a stable
>>>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy 
>>>>>>>>>>> bolt,
>>>>>>>>>>> no io except for reading from kafka queue.
>>>>>>>>>>>
>>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt
>>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field 
>>>>>>>>>>> grouping,
>>>>>>>>>>> msg size=1k).
>>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it
>>>>>>>>>>> believe the topology is dead, then after another 2 minutes, another 
>>>>>>>>>>> kill,
>>>>>>>>>>> then another after another 4 minutes, and on and on.
>>>>>>>>>>>
>>>>>>>>>>> I can understand there might be issues in the coordination among
>>>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any 
>>>>>>>>>>> doable
>>>>>>>>>>> workarounds? I wish there are as so many of you are using it in 
>>>>>>>>>>> production
>>>>>>>>>>> :-)
>>>>>>>>>>>
>>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy
>>>>>>>>>>> topology working!
>>>>>>>>>>>
>>>>>>>>>>> Fang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

Reply via email to