Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2241
Issues that need to be addressed to remove `max.spout.pending` (sorry about
the wall of text).
1. Worker to Worker backpressure. The netty client and server have no
backpressure in them. If for some reason the network cannot keep up the netty
client will continue to buffer messages internally in netty until you get
GC/OOM issues like described in the design doc. We ran into this when using
the ZK based backpressure instead of `max.spout.pending` on a topology with a
very high throughput and there was a network glitch.
2. Single choke point getting messages off of a worker. Currently there is
a single disruptor queue and a single thread that is responsible for routing
all of the messages from within the worker to external workers. If any of the
clients sending messages to other workers did block (backpressure) it would
stop all other messages from leaving the worker. In practice this negates the
"a choke up in a single bolt will not put the brakes on all the topology
components" from your design. And as the number of workers in a topology grows
the impact when this does happen will grow too.
3. Load aware grouping needs to be back on by default and probably improved
some. Again one of the stated goals of your design for backpressure is "a
choke up in a single bolt will not put the brakes on all the topology
components". Most of the existing groupings effectively negate this, as each
upstream component will on average send a message to all down stream components
relatively frequently. Unless we can route around these backed up components
one totally blocked bolt will stop all of a topology from functioning. If you
say the throughput drops by 20% when this is on, then we probably want to do
some profiling and understand why this happens.
4. Timer Tick Tuples. There are a number of things that rely on timer
ticks. Both system critical things, like ackers and spouts timing out tuples
and metrics (at least for now); and code within bolts and spouts that want to
do something periodically without having a separate thread. Right now there is
a single thread that does these ticks for each worker. In the past it was a
real pain to try and debug failures when it would block trying to insert a
message into a queue that was full. Metrics stopped flowing. Spouts stopped
timing things out, and other really odd things happened, that I cannot remember
off the top of my head. I don't want to have to go back to that.
5. Cycles (non DAG topologies). I know we strongly discourage users from
building these types of things, and quite honestly I am happy to deprecate
support for them, but we currently do support them. So if we are going to stop
supporting it lets get on with deprecating it in 1.x with clear warnings etc.
Otherwise when this goes in there will be angry customers with deadlocked
topologies.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---