[ https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356730#comment-14356730 ]
ASF GitHub Bot commented on STORM-329: -------------------------------------- GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/463 Client (Netty): improving logging to help troubleshooting connection woes These logging statements are not on a hot path, and `INFO` is the default log level of Storm. These logging are filling a gap that facilitates understanding connection woes in a Storm cluster (cf. our work on STORM-329). You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm improve-closeChannelAndReconnect-logging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/463.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #463 ---- commit 50aac686de68f783c5767a281922f05ce05e4127 Author: Michael G. Noll <mn...@verisign.com> Date: 2015-03-11T11:07:11Z Add logging to closeChannelAndReconnect() to help with connnection troubleshooting commit ea2a61d91310835a78354052889e37761a68f5cf Author: Michael G. Noll <mn...@verisign.com> Date: 2015-03-11T11:14:18Z Add logging to connect() for corner cases (e.g. client is being closed) ---- > Fix cascading Storm failure by improving reconnection strategy and buffering > messages > ------------------------------------------------------------------------------------- > > Key: STORM-329 > URL: https://issues.apache.org/jira/browse/STORM-329 > Project: Apache Storm > Issue Type: Improvement > Affects Versions: 0.9.2-incubating, 0.9.3 > Reporter: Sean Zhong > Assignee: Michael Noll > Labels: Netty > Fix For: 0.10.0, 0.9.4 > > Attachments: storm-329.patch, worker-kill-recover3.jpg > > > _Note: The original title of this ticket was: "Add Option to Config Message > handling strategy when connection timeout"._ > This is to address a [concern brought > up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] > during the work at STORM-297: > {quote} > [~revans2] wrote: Your logic makes since to me on why these calls are > blocking. My biggest concern around the blocking is in the case of a worker > crashing. If a single worker crashes this can block the entire topology from > executing until that worker comes back up. In some cases I can see that being > something that you would want. In other cases I can see speed being the > primary concern and some users would like to get partial data fast, rather > then accurate data later. > Could we make it configurable on a follow up JIRA where we can have a max > limit to the buffering that is allowed, before we block, or throw data away > (which is what zeromq does)? > {quote} > If some worker crash suddenly, how to handle the message which was supposed > to be delivered to the worker? > 1. Should we buffer all message infinitely? > 2. Should we block the message sending until the connection is resumed? > 3. Should we config a buffer limit, try to buffer the message first, if the > limit is met, then block? > 4. Should we neither block, nor buffer too much, but choose to drop the > messages, and use the built-in storm failover mechanism? -- This message was sent by Atlassian JIRA (v6.3.4#6332)