Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-19 Thread Devang Shah
What's your max spout pending value for the topology ? Also observe the CPU usage, like how many cycles it is spending on the process. Thanks and Regards, Devang On 19 Jun 2015 02:46, "Fang Chen" wrote: > tried. no effect. > > Thanks, > Fang > > On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-18 Thread Fang Chen
tried. no effect. Thanks, Fang On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van wrote: > Can you try this. > > Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so > that YGC happen once every 2-3 seconds? > If that fix the issue then I think GC is the cause of your problem. >

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Binh Nguyen Van
Can you try this. Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so that YGC happen once every 2-3 seconds? If that fix the issue then I think GC is the cause of your problem. On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen wrote: > We use storm bare bones, not trident as it's t

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Fang Chen
We use storm bare bones, not trident as it's too expensive for our use cases. The jvm options for supervisor is listed below but it might not be optimal in any sense. supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:Sur

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Binh Nguyen Van
Just to be sure, are you using Storm or Storm Trident? Also can you share the current setting of your supervisor.child_opts? On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen wrote: > I did enable gc for both worker and supervisor and found nothing abnormal > (pause is minimal and frequency is normal

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Fang Chen
I did enable gc for both worker and supervisor and found nothing abnormal (pause is minimal and frequency is normal too). I tried max spound pending of both 1000 and 500. Fang On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van wrote: > Hi Fang, > > Did you check your GC log? Do you see anything

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Fang Chen
I found a temporary workaround which has just make my toy topology last for over 90 minutes now. I manually restarted all supervisors when I found out any worker went into the hung state and it seems like every component is happy now. I did this just once so I don't know if I need to do it again :-

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Binh Nguyen Van
Hi Fang, Did you check your GC log? Do you see anything abnormal? What is your current max spout pending setting? On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen wrote: > I also did this and find no success. > > Thanks, > Fang > > On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung wrote: > >> After I w

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-15 Thread Fang Chen
I also did this and find no success. Thanks, Fang On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung wrote: > After I wrote that I realized you tried empty topology anyways. This > should reduce any gc or worker initialization related failures though they > are still possible. As Erik mentioned ch

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-14 Thread Fang Chen
atter so much, but we > still have a tool that spaces out the deletes based on disk performance. > > -- > Derek > > -- > *From:* Erik Weathers > *To:* "user@storm.apache.org" > *Sent:* Saturday, June 13, 2015 1:52 AM > *Subjec

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-14 Thread Fang Chen
I also believe something is going on but just can't find it out. But I do observe in all my experiments, the first worker that started to lose heartbeats is the one with kakfa spout task (I have only one spout task). And when it happens, it seems like the whole worker process hangs, none of the bo

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-13 Thread Derek Dagit
sfully run storm 0.9+ in production under reasonable load? There is something fundamentally wrong.  You need to get to the root cause of what the worker process is doing that is preventing the heartbeats from arriving. - Erik On Friday, June 12, 2015, Fang Chen wrote: I tuned up all work

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Erik Weathers
There is something fundamentally wrong. You need to get to the root cause of what the worker process is doing that is preventing the heartbeats from arriving. - Erik On Friday, June 12, 2015, Fang Chen wrote: > I tuned up all worker timeout and task time out to 600 seconds, and seems > like ni

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Fang Chen
I tuned up all worker timeout and task time out to 600 seconds, and seems like nimbus is happy about it after running the topology for 40minutes. But still one supervisor complained timeout from worker and then shut it down: 2015-06-12T23:59:20.633+ b.s.d.supervisor [INFO] Shutting down and cl

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Fang Chen
I turned on debug and seems like the nimbus reassign was indeed caused by heartbeat timeouts after running the topology for about 20 minutes. You can see that those non-live executors have a ":is-timed-out true" status and executor reported time is about 100 second behind nimbus time, while other

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Fang Chen
Thank you Nathan! I will try to a setup with /etc/hosts and see if that makes any difference. Thanks, Fang On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung wrote: > After I wrote that I realized you tried empty topology anyways. This > should reduce any gc or worker initialization related failur

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Fang Chen
supervisor.heartbeat.frequency.secs 5 supervisor.monitor.frequency.secs 3 task.heartbeat.frequency.secs 3 worker.heartbeat.frequency.secs 1 some nimbus parameters: nimbus.monitor.freq.secs 120 nimbus.reassign true nimbus.supervisor.timeout.secs 60 nimbus.task.launch.secs 120 nimbus.task.timeout.

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Nathan Leung
After I wrote that I realized you tried empty topology anyways. This should reduce any gc or worker initialization related failures though they are still possible. As Erik mentioned check ZK. Also I'm not sure if this is still required but it used to be helpful to make sure your storm nodes have

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Nathan Leung
Make sure your topology is starting up in the allotted time, and if not try increasing the startup timeout. On Jun 12, 2015 2:46 AM, "Fang Chen" wrote: > Hi Erik > > Thanks for your reply! It's great to hear about real production usages. > For our use case, we are really puzzled by the outcome s

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-12 Thread Erik Weathers
I'll have to look later, I think we are using ZooKeeper v3.3.6 (something like that). Some clusters have 3 ZK hosts, some 5. The way the nimbus detects that the executors are not alive is by not seeing heartbeats updated in ZK. There has to be some cause for the heartbeats not being updated. Mo

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-11 Thread Fang Chen
Hi Erik Thanks for your reply! It's great to hear about real production usages. For our use case, we are really puzzled by the outcome so far. The initial investigation seems to indicate that workers don't die by themselves ( i actually tried killing the supervisor and the worker would continue r

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-11 Thread Erik Weathers
Yes, the netty errors from a large set of worker deaths really obscure the original root cause. Again you need to diagnose that. - Erik On Thursday, June 11, 2015, Fang Chen wrote: > Forgot to add, one complication of this problem is that, after several > rounds of killing, workers re-spawned

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-11 Thread Erik Weathers
Hey Fang, Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm 0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes. One of the challenges with storm is figuring out what the root cause is when things go haywire. You'll wanna examine why the nimbus decided to rest

Re: Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-11 Thread Fang Chen
Forgot to add, one complication of this problem is that, after several rounds of killing, workers re-spawned can no longer talk to their peers, with all sorts of netty exceptions. On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen wrote: > We have been testing storm from 0.9.0.1 until 0.9.4 (I have not

Has anybody successfully run storm 0.9+ in production under reasonable load?

2015-06-11 Thread Fang Chen
We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried 0.9.5 yet but I don't see any significant differences there), and unfortunately we could not even have a clean run for over 30 minutes on a cluster of 5 high-end nodes. zookeeper is also set up on these nodes but on different dis