What's your max spout pending value for the topology ?
Also observe the CPU usage, like how many cycles it is spending on the
process.
Thanks and Regards,
Devang
On 19 Jun 2015 02:46, "Fang Chen" wrote:
> tried. no effect.
>
> Thanks,
> Fang
>
> On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van
tried. no effect.
Thanks,
Fang
On Mon, Jun 15, 2015 at 9:33 PM, Binh Nguyen Van wrote:
> Can you try this.
>
> Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so
> that YGC happen once every 2-3 seconds?
> If that fix the issue then I think GC is the cause of your problem.
>
Can you try this.
Remove -XX:+CMSScavengeBeforeRemark flag and reduce your heap size so that
YGC happen once every 2-3 seconds?
If that fix the issue then I think GC is the cause of your problem.
On Mon, Jun 15, 2015 at 11:56 AM, Fang Chen wrote:
> We use storm bare bones, not trident as it's t
We use storm bare bones, not trident as it's too expensive for our use
cases. The jvm options for supervisor is listed below but it might not be
optimal in any sense.
supervisor.childopts: "-Xms2G -Xmx2G -XX:NewSize=1G -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:Sur
Just to be sure, are you using Storm or Storm Trident?
Also can you share the current setting of your supervisor.child_opts?
On Mon, Jun 15, 2015 at 11:39 AM, Fang Chen wrote:
> I did enable gc for both worker and supervisor and found nothing abnormal
> (pause is minimal and frequency is normal
I did enable gc for both worker and supervisor and found nothing abnormal
(pause is minimal and frequency is normal too). I tried max spound pending
of both 1000 and 500.
Fang
On Mon, Jun 15, 2015 at 11:36 AM, Binh Nguyen Van
wrote:
> Hi Fang,
>
> Did you check your GC log? Do you see anything
I found a temporary workaround which has just make my toy topology last for
over 90 minutes now. I manually restarted all supervisors when I found out
any worker went into the hung state and it seems like every component is
happy now. I did this just once so I don't know if I need to do it again :-
Hi Fang,
Did you check your GC log? Do you see anything abnormal?
What is your current max spout pending setting?
On Mon, Jun 15, 2015 at 11:28 AM, Fang Chen wrote:
> I also did this and find no success.
>
> Thanks,
> Fang
>
> On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung wrote:
>
>> After I w
I also did this and find no success.
Thanks,
Fang
On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung wrote:
> After I wrote that I realized you tried empty topology anyways. This
> should reduce any gc or worker initialization related failures though they
> are still possible. As Erik mentioned ch
atter so much, but we
> still have a tool that spaces out the deletes based on disk performance.
>
> --
> Derek
>
> --
> *From:* Erik Weathers
> *To:* "user@storm.apache.org"
> *Sent:* Saturday, June 13, 2015 1:52 AM
> *Subjec
I also believe something is going on but just can't find it out.
But I do observe in all my experiments, the first worker that started to
lose heartbeats is the one with kakfa spout task (I have only one spout
task). And when it happens, it seems like the whole worker process hangs,
none of the bo
sfully run storm 0.9+ in production under
reasonable load?
There is something fundamentally wrong. You need to get to the root cause of
what the worker process is doing that is preventing the heartbeats from
arriving.
- Erik
On Friday, June 12, 2015, Fang Chen wrote:
I tuned up all work
There is something fundamentally wrong. You need to get to the root cause
of what the worker process is doing that is preventing the heartbeats from
arriving.
- Erik
On Friday, June 12, 2015, Fang Chen wrote:
> I tuned up all worker timeout and task time out to 600 seconds, and seems
> like ni
I tuned up all worker timeout and task time out to 600 seconds, and seems
like nimbus is happy about it after running the topology for 40minutes. But
still one supervisor complained timeout from worker and then shut it down:
2015-06-12T23:59:20.633+ b.s.d.supervisor [INFO] Shutting down and
cl
I turned on debug and seems like the nimbus reassign was indeed caused by
heartbeat timeouts after running the topology for about 20 minutes. You can
see that those non-live executors have a ":is-timed-out true" status and
executor reported time is about 100 second behind nimbus time, while other
Thank you Nathan!
I will try to a setup with /etc/hosts and see if that makes any difference.
Thanks,
Fang
On Fri, Jun 12, 2015 at 6:04 AM, Nathan Leung wrote:
> After I wrote that I realized you tried empty topology anyways. This
> should reduce any gc or worker initialization related failur
supervisor.heartbeat.frequency.secs 5
supervisor.monitor.frequency.secs 3
task.heartbeat.frequency.secs 3
worker.heartbeat.frequency.secs 1
some nimbus parameters:
nimbus.monitor.freq.secs 120
nimbus.reassign true
nimbus.supervisor.timeout.secs 60
nimbus.task.launch.secs 120
nimbus.task.timeout.
After I wrote that I realized you tried empty topology anyways. This
should reduce any gc or worker initialization related failures though they
are still possible. As Erik mentioned check ZK. Also I'm not sure if this
is still required but it used to be helpful to make sure your storm nodes
have
Make sure your topology is starting up in the allotted time, and if not try
increasing the startup timeout.
On Jun 12, 2015 2:46 AM, "Fang Chen" wrote:
> Hi Erik
>
> Thanks for your reply! It's great to hear about real production usages.
> For our use case, we are really puzzled by the outcome s
I'll have to look later, I think we are using ZooKeeper v3.3.6 (something
like that). Some clusters have 3 ZK hosts, some 5.
The way the nimbus detects that the executors are not alive is by not
seeing heartbeats updated in ZK. There has to be some cause for the
heartbeats not being updated. Mo
Hi Erik
Thanks for your reply! It's great to hear about real production usages.
For our use case, we are really puzzled by the outcome so far. The initial
investigation seems to indicate that workers don't die by themselves ( i
actually tried killing the supervisor and the worker would continue r
Yes, the netty errors from a large set of worker deaths really obscure the
original root cause. Again you need to diagnose that.
- Erik
On Thursday, June 11, 2015, Fang Chen wrote:
> Forgot to add, one complication of this problem is that, after several
> rounds of killing, workers re-spawned
Hey Fang,
Yes, Groupon runs storm 0.9.3 (with zeromq instead of netty) and storm
0.9.4 (with netty) at scale, in clusters on the order of 30+ nodes.
One of the challenges with storm is figuring out what the root cause is
when things go haywire. You'll wanna examine why the nimbus decided to
rest
Forgot to add, one complication of this problem is that, after several
rounds of killing, workers re-spawned can no longer talk to their peers,
with all sorts of netty exceptions.
On Thu, Jun 11, 2015 at 9:51 PM, Fang Chen wrote:
> We have been testing storm from 0.9.0.1 until 0.9.4 (I have not
We have been testing storm from 0.9.0.1 until 0.9.4 (I have not tried 0.9.5
yet but I don't see any significant differences there), and unfortunately
we could not even have a clean run for over 30 minutes on a cluster of 5
high-end nodes. zookeeper is also set up on these nodes but on different
dis
25 matches
Mail list logo