[ https://issues.apache.org/jira/browse/STORM-406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rick Kellogg updated STORM-406: ------------------------------- Labels: (was: b) > Trident topologies getting stuck when using Netty transport (reproducible) > -------------------------------------------------------------------------- > > Key: STORM-406 > URL: https://issues.apache.org/jira/browse/STORM-406 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 0.9.2-incubating, 0.9.1-incubating, 0.9.0.1 > Environment: Linux, OpenJDK 7 > Reporter: Danijel Schiavuzzi > Assignee: Kishor Patil > Priority: Blocker > Fix For: 0.9.3 > > > When using the new, default Netty transport, Trident topologies sometimes get > stuck, while under ZeroMQ everything is working fine. > I can reliably reproduce this issue by killing a Storm worker on a running > Trident topology. If the worker gets re-spawned on the same slot (port), the > topology stops processing. But if the worker re-spawns on a different port, > topology processing continues normally. > The Storm cluster configuration is pretty standard, there are two Supervisor > nodes, one node has also Nimbus, UI and DRPC running on it. I have four slots > per Supervisor, and run my test topology with setNumWorkers set to 8 so that > it occupies all eight slots across the cluster. Killing a worker in this > configuration will always re-spawn the worker on the same node and slot > (port), thus causing the topology to stop processing. This is 100% > reproducible on a few Storm clusters of mine, across multiple Storm versions > (0.9.0.1, 0.9.1, 0.9.2). > I have reproduced this with multiple Trident topologies, the simplest of > which is the TridentWordCount topology from storm-starter. I've just modified > it a little to add an additional Trident filter to log the tuple throughput: > https://github.com/dschiavu/storm-trident-stuck-topology > Non-transactional Trident topologies just silently stop processing, while > transactional topologies continuously retry the batches and are re-emitted by > the spout, however they never get processed by the next bolts in the chain so > they time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)