tried. no effect. Thanks, Fang
On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van <binhn...@gmail.com> wrote: > Can you try this. > > Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so > that YGC happen once every 2-3 seconds? > If that fix the issue then I think GC is the cause of your problem. > > On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen <fc2...@gmail.com> wrote: > >> We use storm bare bones, not trident as it's too expensive for our use >> cases. The jvm options for supervisor is listed below but it might not be >> optimal in any sense. >> >> supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC >> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=6 >> -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly >> -XX:+UseTLAB -XX:+UseCondCardMark -XX:CMSWaitDuration=5000 >> -XX:+CMSScavengeBeforeRemark -XX:+UnlockDiagnosticVMOptions >> -XX:ParGCCardsPerStrideChunk=4096 -XX:+ExplicitGCInvokesConcurrent >> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution >> -XX:PrintFLSStatistics=1 -XX:+PrintPromotionFailure >> -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC >> -XX:+PrintSafepointStatistics -Xloggc:/usr/local/storm/logs/gc.log >> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 >> -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote.port=9998 >> -Dcom.sun.management.jmxremote.ssl=false >> -Dcom.sun.management.jmxremote.authenticate=false" >> >> >> >> On Mon, Jun 15, 2015 at 11:48 AM, Binh Nguyen Van <binhn...@gmail.com> >> wrote: >> >>> Just to be sure, are you using Storm or Storm Trident? >>> Also can you share the current setting of your supervisor.child_opts? >>> >>> On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen <fc2...@gmail.com> wrote: >>> >>>> I did enable gc for both worker and supervisor and found nothing >>>> abnormal (pause is minimal and frequency is normal too). I tried max >>>> spound pending of both 1000 and 500. >>>> >>>> Fang >>>> >>>> On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van <binhn...@gmail.com> >>>> wrote: >>>> >>>>> Hi Fang, >>>>> >>>>> Did you check your GC log? Do you see anything abnormal? >>>>> What is your current max spout pending setting? >>>>> >>>>> On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen <fc2...@gmail.com> wrote: >>>>> >>>>>> I also did this and find no success. >>>>>> >>>>>> Thanks, >>>>>> Fang >>>>>> >>>>>> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung <ncle...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> After I wrote that I realized you tried empty topology anyways. >>>>>>> This should reduce any gc or worker initialization related failures >>>>>>> though >>>>>>> they are still possible. As Erik mentioned check ZK. Also I'm not >>>>>>> sure if >>>>>>> this is still required but it used to be helpful to make sure your storm >>>>>>> nodes have each other listed in /etc/hosts. >>>>>>> On Jun 12, 2015 8:59 AM, "Nathan Leung" <ncle...@gmail.com> wrote: >>>>>>> >>>>>>>> Make sure your topology is starting up in the allotted time, and if >>>>>>>> not try increasing the startup timeout. >>>>>>>> On Jun 12, 2015 2:46 AM, "Fang Chen" <fc2...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Erik >>>>>>>>> >>>>>>>>> Thanks for your reply! It's great to hear about real production >>>>>>>>> usages. For our use case, we are really puzzled by the outcome so >>>>>>>>> far. The >>>>>>>>> initial investigation seems to indicate that workers don't die by >>>>>>>>> themselves ( i actually tried killing the supervisor and the worker >>>>>>>>> would >>>>>>>>> continue running beyond 30 minutes). >>>>>>>>> >>>>>>>>> The sequence of events is like this: supervisor immediately >>>>>>>>> complains worker "still has not started" for a few seconds right after >>>>>>>>> launching the worker process, then silent --> after 26 minutes, nimbus >>>>>>>>> complains executors (related to the worker) "not alive" and started to >>>>>>>>> reassign topology --> after another ~500 milliseconds, the supervisor >>>>>>>>> shuts >>>>>>>>> down its worker --> other peer workers complain about netty issues. >>>>>>>>> and the >>>>>>>>> loop goes on. >>>>>>>>> >>>>>>>>> Could you kindly tell me what version of zookeeper is used with >>>>>>>>> 0.9.4? and how many nodes in the zookeeper cluster? >>>>>>>>> >>>>>>>>> I wonder if this is due to zookeeper issues. >>>>>>>>> >>>>>>>>> Thanks a lot, >>>>>>>>> Fang >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 11, 2015 at 10:02 PM, Erik Weathers < >>>>>>>>> eweath...@groupon.com> wrote: >>>>>>>>> >>>>>>>>>> Hey Fang, >>>>>>>>>> >>>>>>>>>> Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and >>>>>>>>>> storm 0.9.4 (with netty) at scale, in clusters on the order of >>>>>>>>>> 30+ nodes. >>>>>>>>>> >>>>>>>>>> One of the challenges with storm is figuring out what the root >>>>>>>>>> cause is when things go haywire. You'll wanna examine why the nimbus >>>>>>>>>> decided to restart your worker processes. It would happen when >>>>>>>>>> workers die >>>>>>>>>> and the nimbus notices that storm executors aren't alive. (There >>>>>>>>>> are logs >>>>>>>>>> in nimbus for this.) Then you'll wanna dig into why the workers >>>>>>>>>> died by >>>>>>>>>> looking at logs on the worker hosts. >>>>>>>>>> >>>>>>>>>> - Erik >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thursday, June 11, 2015, Fang Chen <fc2...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not >>>>>>>>>>> tried 0.9.5 yet but I don't see any significant differences there), >>>>>>>>>>> and >>>>>>>>>>> unfortunately we could not even have a clean run for over 30 >>>>>>>>>>> minutes on a >>>>>>>>>>> cluster of 5 high-end nodes. zookeeper is also set up on these >>>>>>>>>>> nodes but on >>>>>>>>>>> different disks. >>>>>>>>>>> >>>>>>>>>>> I have huge troubles to give my data analytics topology a stable >>>>>>>>>>> run. So I tried the simplest topology I can think of, just an emtpy >>>>>>>>>>> bolt, >>>>>>>>>>> no io except for reading from kafka queue. >>>>>>>>>>> >>>>>>>>>>> Just to report my latest testing on 0.9.4 with this empty bolt >>>>>>>>>>> (kakfa topic partition=1, spout task #=1, bolt #=20 with field >>>>>>>>>>> grouping, >>>>>>>>>>> msg size=1k). >>>>>>>>>>> After 26 minutes, nimbus orders to kill the topology as it >>>>>>>>>>> believe the topology is dead, then after another 2 minutes, another >>>>>>>>>>> kill, >>>>>>>>>>> then another after another 4 minutes, and on and on. >>>>>>>>>>> >>>>>>>>>>> I can understand there might be issues in the coordination among >>>>>>>>>>> nimbus, worker and executor (e.g., heartbeats). But are there any >>>>>>>>>>> doable >>>>>>>>>>> workarounds? I wish there are as so many of you are using it in >>>>>>>>>>> production >>>>>>>>>>> :-) >>>>>>>>>>> >>>>>>>>>>> I deeply appreciate any suggestions that could even make my toy >>>>>>>>>>> topology working! >>>>>>>>>>> >>>>>>>>>>> Fang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>>> >>> >> >