[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/568#issuecomment-108339300 And like Bobby already said: many thanks for your continued work and improvements, @eshioji! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/568#issuecomment-108338846 @eshioji wrote: Also I have a question, maybe @miguno could help; I've removed the graceful shutdown which tries to flush all pending message before the Client is closed, mostly to make it easier for me to fix the deadlock. However now I'm worried I might have removed something significant. Do you think I should bring it back? IIRC the graceful shutdown was primarily (but not exclusively) for non-acking topologies to minimize any potential data loss. I think it would be preferable if we'd continue to allow for graceful shutdowns, if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-512 KafkaBolt doesn't handle ticks prope...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/275#issuecomment-93694857 I created [STORM-786](https://issues.apache.org/jira/browse/STORM-786) to track the tick tuple acking. Pull request is already sent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-786: KafkaBolt should ack tick tuples
GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/522 STORM-786: KafkaBolt should ack tick tuples [STORM-512](https://issues.apache.org/jira/browse/STORM-512) (KafkaBolt doesn't handle ticks properly) adds special-casing of tick tuples. What is missing in the patch is that the input tuple, when it is a tick tuple, should be properly acked like normal tuples. You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm STORM-786 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #522 commit dfe3fbe144ed6398c2d2cc6210454d4c561e042e Author: Michael G. Noll mich...@michael-noll.com Date: 2015-04-16T09:44:40Z STORM-786: KafkaBolt should ack tick tuples --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-512 KafkaBolt doesn't handle ticks prope...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/275#issuecomment-93699373 My understanding is yes, you do need to ack tick tuples. See @nathanmarz [comment](https://groups.google.com/forum/#!topic/storm-user/ZEJabXT5nQA) from some time back: Do I need to ack tick tuples? Since they're system generated I'm think yes but I'm not sure. Nathan's answer: You should ack all tuples. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-512 KafkaBolt doesn't handle ticks prope...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/275#issuecomment-93690832 Since KafkaBolt is extending BaseRichBolt, I think we should perform a `collector.ack(input)` before `return`. Tick tuples must be acked like normal tuples. ``` public class KafkaBoltK, V extends BaseRichBolt { public void execute(Tuple input) { if (TupleUtils.isTick(input)) { return; // Do not try to send ticks to Kafka } ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-82217610 @clockfly We may also want to check why we don't have the permissions to close this PR ourselves. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-81736347 @d2r And btw I don't have permissions on GitHub to close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-707: Client (Netty): improve logging to ...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/463#issuecomment-81739459 I created [STORM-707](https://issues.apache.org/jira/browse/STORM-707) to track this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Client (Netty): improving logging to help trou...
GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/463 Client (Netty): improving logging to help troubleshooting connection woes These logging statements are not on a hot path, and `INFO` is the default log level of Storm. These logging are filling a gap that facilitates understanding connection woes in a Storm cluster (cf. our work on STORM-329). You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm improve-closeChannelAndReconnect-logging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/463.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #463 commit 50aac686de68f783c5767a281922f05ce05e4127 Author: Michael G. Noll mn...@verisign.com Date: 2015-03-11T11:07:11Z Add logging to closeChannelAndReconnect() to help with connnection troubleshooting commit ea2a61d91310835a78354052889e37761a68f5cf Author: Michael G. Noll mn...@verisign.com Date: 2015-03-11T11:14:18Z Add logging to connect() for corner cases (e.g. client is being closed) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-695] storm CLI tool reports zero exit c...
Github user miguno commented on a diff in the pull request: https://github.com/apache/storm/pull/456#discussion_r25774104 --- Diff: bin/storm --- @@ -183,7 +209,16 @@ def exec_storm_class(klass, jvmtype=-server, jvmopts=[], extrajars=[], args=[] os.spawnvp(os.P_WAIT, JAVA_CMD, all_args) else: # handling whitespaces in JAVA_CMD -sub.call(all_args) +try: +ret = sub.check_output(all_args, stderr=sub.STDOUT) +print(ret) +except sub.CalledProcessError as e: +# Handling exception with stdout and stderr +if e.returncode != 0: --- End diff -- I think this check is redundant, given what the Python docs say: subprocess.check_output(): [...] If the return code was non-zero it raises a CalledProcessError. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-695] storm CLI tool reports zero exit c...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/456#issuecomment-77159580 Many thanks for the PR, Kai! However if the topology is submitted with `storm shell`, this cannot track exception because spawnvp seems to return no stdout and stderr. See [my comment](https://issues.apache.org/jira/browse/STORM-695?focusedCommentId=14340333page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14340333) in the STORM-695 JIRA ticket. `os.spawnvp` returns the exit code of the process: ```python # return code can be 0 or non-zero, i.e. there will be no exception thrown like for sub.check_output return_code = os.spawnvp(os.P_WAIT, JAVA_CMD, all_args) ``` However the problem is that Storm is currently not returning any non-zero return codes, even in the face of errors. I haven't had the chance to test-drive your code yet, but the code in the PR might not work for the same reason: ```python try: ret = sub.check_output(all_args, stderr=sub.STDOUT) print(ret) except sub.CalledProcessError as e: ``` If I understand the Python docs correctly, the `CalledProcessError` will only be thrown if the process returns with a non-zero exit code, and with current Storm this won't happen -- the return code will always be zero I think. One workaround could be to slightly adjust the code in the PR to simply grep for error messages in `STDOUT` / `STDERR` -- regardless of what the return code of the process is -- and, if there is a match, consider the process/command failed. (And please correct me if I'm wrong.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: [STORM-695] storm CLI tool reports zero exit c...
Github user miguno commented on a diff in the pull request: https://github.com/apache/storm/pull/456#discussion_r25774112 --- Diff: bin/storm --- @@ -183,7 +209,16 @@ def exec_storm_class(klass, jvmtype=-server, jvmopts=[], extrajars=[], args=[] os.spawnvp(os.P_WAIT, JAVA_CMD, all_args) else: # handling whitespaces in JAVA_CMD -sub.call(all_args) +try: +ret = sub.check_output(all_args, stderr=sub.STDOUT) +print(ret) --- End diff -- Why the print? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75713928 Oh, Taylor. Could you also update STORM-404 and STORM-510 as appropriate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-75634123 Phew. :-) Thanks for merging, Taylor! On 23.02.2015, at 22:06, P. Taylor Goetz notificati...@github.com wrote: Disregard last message. It was a merge mistake (picked right when I should have picked left). All tests are passing now. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74927415 @nathanmarz Thanks for the detailed feedback on the max-retries issue. As Bobby suggested, would you mind if we decouple the work on max-retries (tracked at STORM-677) from this pull request / STORM-329? The max-retries issue has been a problem in Storm for a while, since before this pull request. Decoupling would also help us with backtracking any future issues to either ticket. Right now, I feel there's almost too much meat in this pull request already (as we kinda covered STORM-329, STORM-404, and STORM-510 in one big swing). Would that be ok for you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74894542 FYI: I created [STORM-677: Maximum retries strategy may cause data loss](https://issues.apache.org/jira/browse/STORM-677) to address the issue that Bobby brought up in this discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74895041 PS: We may also want to update the original [STORM-329](https://issues.apache.org/jira/browse/STORM-329) ticket description to reflect the changes in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74898690 Thanks, Taylor! Let me know if I can help with sorting out the test failures. Also regarding JIRA: I forgot to mention that it looks like we need to update STORM-404 and STORM-510 as well as STORM-329 has been said to cover those issues, too. It looks to me like this is actually the case, but I'd prefer another pair of eyes to be sure we're not mistakenly closing two related tickets. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74702215 And FWIW, with the code in this PR the total test suite takes about 5mins to complete. ``` $ mvn clean install ... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Storm .. SUCCESS [ 1.696 s] [INFO] maven-shade-clojure-transformer SUCCESS [ 1.930 s] [INFO] storm-maven-plugins SUCCESS [ 2.033 s] [INFO] Storm Core . SUCCESS [03:56 min] [INFO] storm-starter .. SUCCESS [ 6.616 s] [INFO] storm-kafka SUCCESS [ 48.469 s] [INFO] storm-hdfs . SUCCESS [ 1.875 s] [INFO] storm-hbase SUCCESS [ 2.271 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 05:02 min [INFO] Finished at: 2015-02-17T17:49:19+01:00 [INFO] Final Memory: 67M/282M [INFO] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74701138 I am seeing a lot of tests timing out with this change. Has anyone else seen this? Hmm. All the tests are passing for me (and they have been since a while). Do you have any pointers? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on a diff in the pull request: https://github.com/apache/storm/pull/429#discussion_r24738502 --- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java --- @@ -42,344 +42,577 @@ import org.slf4j.LoggerFactory; import backtype.storm.Config; +import backtype.storm.messaging.ConnectionWithStatus; import backtype.storm.metric.api.IStatefulObject; -import backtype.storm.messaging.IConnection; import backtype.storm.messaging.TaskMessage; import backtype.storm.utils.StormBoundedExponentialBackoffRetry; import backtype.storm.utils.Utils; -public class Client implements IConnection, IStatefulObject{ +/** + * A Netty client for sending task messages to a remote destination (Netty server). + * + * Implementation details: + * + * - Sending messages, i.e. writing to the channel, is performed asynchronously. + * - Messages are sent in batches to optimize for network throughput at the expense of network latency. The message + * batch size is configurable. + * - Connecting and reconnecting are performed asynchronously. + * - Note: The current implementation drops any messages that are being enqueued for sending if the connection to + * the remote destination is currently unavailable. + * - A background flusher thread is run in the background. It will, at fixed intervals, check for any pending messages + * (i.e. messages buffered in memory) and flush them to the remote destination iff background flushing is currently + * enabled. + */ +public class Client extends ConnectionWithStatus implements IStatefulObject { + private static final Logger LOG = LoggerFactory.getLogger(Client.class); private static final String PREFIX = Netty-Client-; -private final int max_retries; -private final int base_sleep_ms; -private final int max_sleep_ms; +private static final long NO_DELAY_MS = 0L; +private static final long MINIMUM_INITIAL_DELAY_MS = 3L; +private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L; +private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L; +private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE; + private final StormBoundedExponentialBackoffRetry retryPolicy; -private AtomicReferenceChannel channelRef; private final ClientBootstrap bootstrap; -private InetSocketAddress remote_addr; - -private AtomicInteger totalReconnects; -private AtomicInteger messagesSent; -private AtomicInteger messagesLostReconnect; -private final Random random = new Random(); -private final ChannelFactory factory; -private final int buffer_size; -private boolean closing; - -private int messageBatchSize; - -private AtomicLong pendings; - -Map storm_conf; +private final InetSocketAddress dstAddress; +protected final String dstAddressPrefixedName; + +/** + * The channel used for all write operations from this client to the remote destination. + */ +private final AtomicReferenceChannel channelRef = new AtomicReferenceChannel(null); + + +/** + * Maximum number of reconnection attempts we will perform after a disconnect before giving up. + */ +private final int maxReconnectionAttempts; + +/** + * Total number of connection attempts. + */ +private final AtomicInteger totalConnectionAttempts = new AtomicInteger(0); + +/** + * Number of connection attempts since the last disconnect. + */ +private final AtomicInteger connectionAttempts = new AtomicInteger(0); + +/** + * Number of messages successfully sent to the remote destination. + */ +private final AtomicInteger messagesSent = new AtomicInteger(0); + +/** + * Number of messages that could not be sent to the remote destination. + */ +private final AtomicInteger messagesLost = new AtomicInteger(0); + +/** + * Number of messages buffered in memory. + */ +private final AtomicLong pendingMessages = new AtomicLong(0); + +/** + * This flag is set to true if and only if a client instance is being closed. + */ +private volatile boolean closing = false; + +/** + * When set to true, then the background flusher thread will flush any pending messages on its next run. + */ +private final AtomicBoolean backgroundFlushingEnabled = new AtomicBoolean(false); + +/** + * The absolute time (in ms) when the next background flush should be performed. + * + * Note
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74147926 If you need at-least-once processing you must use an acking topology, which will allow Storm to replay lost messages. If instead you go with an unacking topology (= no guaranteed message processing) then you may run into data loss. There re pros and cons for each variant, and e.g. in our case we use both depending on the use case. Also: The semantics described above have been in Storm right from the beginning. None of these have been changed by this pull request. On 12.02.2015, at 20:01, Daniel Schonfeld notificati...@github.com wrote: Doesn't dropping the messages coming from a non ack/fail caring spout negate the 'at least once' attempt of storm? I mean doesn't that kinda force you to make all your spouts ack/fail aware where before you could have gotten away without it? In other words. There is a chance that if the worker that died is the one containing the spout and if the first bolt is located on another worker, that technically at-least once wasn't tried but rather fell to the floor right away. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/429#issuecomment-74032874 This patch allows a worker to properly detect that the connection to a peer becomes unavailable -- for whatever reason (the remote worker is dead or restarting, there was a network glitch, etc). Also, any reconnection attempts are now async so that reconnecting will not block other activities of the worker (such as sending messages to other workers it is still connected to). So to your question: this patch includes the case of the worker that remained alive trying to (re)connect to a dead peer. Does that help? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/428 STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages This is an improved version of the original pull request discussed at https://github.com/apache/storm/pull/268. Please refer to the discussion in the link above. **Note**: Please give attribution to @tedxia when merging the pull request as he did a lot (most?) of the work in this pull request. The changes of this pull request include: - Most importantly, we fix a bug in Storm that may cause a cascading failure in a Storm cluster, to the point where the whole cluster becomes unusable. This is achieved by the work described in the next bullet points. - We refactor and improve the Netty messaging backend, notably the client. - During the initial startup of a topology, Storm will now wait until worker (Netty) connections are ready for operation. See the [original discussion thread](https://github.com/apache/storm/pull/268) for the detailed explanation and justification of this change. @clockfly, @tedxia: Please add any further comments to STORM-329 to this pull request, if possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/428.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #428 commit 8c52a5b518021a6beff372acbeb66a963a1d4f74 Author: xiajun xia...@xiaomi.com Date: 2014-09-24T12:39:18Z STORM-329 : buffer message in client and reconnect remote server async commit 7cc8c8b1b59d415f1fa54e081127336fcdaeb706 Author: xiajun xia...@xiaomi.com Date: 2014-09-26T02:43:55Z STORM-329: fix continue flush after client had been closed commit 9826bc9ef7aee8c83e90d56f52ac70ac7165d769 Author: xiajun xia...@xiaomi.com Date: 2014-09-30T03:25:08Z Client not clean timeout TaskMessage commit 978c969fdb4b5904b6a87c100fbd80fe26bf39cf Author: Sean Zhong clock...@gmail.com Date: 2014-10-20T01:08:43Z Merge remote-tracking branch 'upstream/master' commit 44f8260bbf489c6a2741fb4d8f9196ea6ddb51cc Author: Sean Zhong clock...@gmail.com Date: 2014-10-20T01:21:25Z STORM-404, STORM-510, STORM-329: 1. break the reconnection to target if worker is informed that the target is the down, we that we avoid RuntimeException when reconnection failed. 2. When worker get started, need to make sure all target workers are alive before launching spouts 3. When target worker is down, all messages sending to the target worker will be dropped. commit dea5fbe35c4d9b18a89dae320f9fc985f25bd31a Author: Sean Zhong clock...@gmail.com Date: 2014-10-20T01:24:45Z STORM-329: fix comment grammar commit 16de9f3321827624865a4450e93bd53efb75ed93 Author: Sean Zhong clock...@gmail.com Date: 2014-10-28T03:31:23Z test commit e8dcf9155c85d3541c9352faf9a0651614b93eb6 Author: Sean Zhong clock...@gmail.com Date: 2014-10-29T01:56:19Z STORM-329: avoid deadlock commit 22e7014dedfd580de9dd1d6b2083c8fb3d77d406 Author: Sean Zhong clock...@gmail.com Date: 2014-10-29T01:59:09Z Revert test This reverts commit 16de9f3321827624865a4450e93bd53efb75ed93. commit ddef6667cdc6de9aebf0d4006ad9e7df2bfbb3bb Author: Sean Zhong clock...@gmail.com Date: 2014-10-29T08:14:34Z Merge remote-tracking branch 'upstream/master' commit baf3c628db3899def89ee92c752524d041bc8b40 Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:17:07Z STORM-329: fix UT. Add a new flag in worker data worker-active-flag, Wait connections to be ready asyncly. commit e1c463f5681dbf6a66c868d467aca14064da1e9b Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:22:22Z STORM-329: fix comments. Add a description about storm.messaging.netty.max_retries, that the reconnection period should also be bigger than storm.zookeeper.session.timeout, so that the reconnection can be aborted(when target worker is dead) before the reconnection failed and throw RunTimeException commit 2d3fad121481da40258af27e6d7fbcb148365e76 Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:22:22Z STORM-329: fix a integration issue commit 60f04f9e397cf49e4fba6fe6a2f0bfb23d5a8605 Author: Sean Zhong clock...@gmail.com Date: 2014-10-30T10:28:17Z Merge branch 'master' of https://github.com/tedxia/incubator-storm commit 41aafbecac2cf3295255c7dc9b299b8c0c555390 Author: xiajun xia...@xiaomi.com Date: 2014-11-18T06:53:14Z Merge remote-tracking branch 'remotes/apache-storm/0.9.3-branch' into ted-master Conflicts: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java commit 2c39866cf8bbca136d3b88f796b9f847b282fdd7 Author: xiajun xia
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73946399 I did exactly this in #429. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno closed the pull request at: https://github.com/apache/storm/pull/428 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/428#issuecomment-73928303 I tried rebasing (also to fix the incorrect commit message that starts with STORM-32*7*) but gave up after several failed attempts. Feel free to give it a try though -- maybe your git-fu is better than mine. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...
GitHub user miguno opened a pull request: https://github.com/apache/storm/pull/429 STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages **This PR contains the same code as https://github.com/apache/storm/pull/428 but as a single commit for a cleaner commit history of our Storm repo.** -- This is an improved version of the original pull request discussed at https://github.com/apache/storm/pull/268. Please refer to the discussion in the link above. **Note**: Please give attribution to @tedxia when merging the pull request as he did a lot (most?) of the work in this pull request. The changes of this pull request include: - Most importantly, we fix a bug in Storm that may cause a cascading failure in a Storm cluster, to the point where the whole cluster becomes unusable. This is achieved by the work described in the next bullet points. - We refactor and improve the Netty messaging backend, notably the client. - During the initial startup of a topology, Storm will now wait until worker (Netty) connections are ready for operation. See the [original discussion thread](https://github.com/apache/storm/pull/268) for the detailed explanation and justification of this change. @clockfly, @tedxia: Please add any further comments to STORM-329 to this pull request, if possible. You can merge this pull request into a Git repository by running: $ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329-diff Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/429.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #429 commit 205eaf4ebe28ab5550a842ea9aabd23b41678743 Author: Michael G. Noll mn...@verisign.com Date: 2015-02-11T18:55:53Z STORM-329: fix cascading Storm failure by improving reconnection strategy and buffering messages Thanks to @tedxia for the initial work on this patch, which covered a lot if not most of the work! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-73667341 Many thanks for your review, Sean. I addressed your comments, see the new commits in https://github.com/miguno/storm/commits/0.10.0-SNAPSHOT-STORM-392-miguno-merge. One final question: How should we address the following TODO in `send(IteratorTaskMessage msgs)`? I'd appreciate additional eyeballs on this. :-) It's a scenario that we may or may not have overlooked in the original code. Note how we are NOT checking for a `WRITABLE` channel while flushing (full) message batches (cf. the `while` loop), but we do check for a `WRITABLE` channel when handling any left-over messages (cf. after the `while` loop). ```java /** * Enqueue task messages to be sent to the remote destination (cf. `host` and `port`). */ @Override public synchronized void send(IteratorTaskMessage msgs) { // ...some code removed to shorten this code snippet... Channel channel = channelRef.get(); if (!connectionEstablished(channel)) { // Closing the channel and reconnecting should be done before handling the messages. closeChannelAndReconnect(channel); handleMessagesWhenConnectionIsUnavailable(msgs); return; } // Collect messages into batches (to optimize network throughput), then flush them. while (msgs.hasNext()) { TaskMessage message = msgs.next(); if (messageBatch == null) { messageBatch = new MessageBatch(messageBatchSize); } messageBatch.add(message); // TODO: What shall we do if the channel is not writable? if (messageBatch.isFull()) { MessageBatch toBeFlushed = messageBatch; flushMessages(channel, toBeFlushed); messageBatch = null; } } // Handle any remaining messages in case the last batch was not full. if (containsMessages(messageBatch)) { if (connectionEstablished(channel) channel.isWritable()) { // We can write to the channel, so we flush the remaining messages immediately to minimize latency. pauseBackgroundFlushing(); MessageBatch toBeFlushed = messageBatch; messageBatch = null; flushMessages(channel, toBeFlushed); } else { // We cannot write to the channel, which means Netty's internal write buffer is full. // In this case, we buffer the remaining messages and wait for the next messages to arrive. // // Background: // Netty 3.x maintains an internal write buffer with a high water mark for each channel (default: 64K). // This represents the amount of data waiting to be flushed to operating system buffers. If the // outstanding data exceeds this value then the channel is set to non-writable. When this happens, a // INTEREST_CHANGED channel event is triggered. Netty sets the channel to writable again once the data // has been flushed to the system buffers. // // See http://stackoverflow.com/questions/14049260 resumeBackgroundFlushing(); nextBackgroundFlushTimeMs.set(nowMillis() + flushCheckIntervalMs); } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-73671537 PS: I just noticed that I put the wrong Storm ticket into the commit https://github.com/miguno/storm/commit/8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3. The `STORM-327` prefix of the commit message should have been `STORM-329`. I will keep the current code/commits as is in order to not disrupt the ongoing discussion in this thread. Once we're happy with the code I will create a separate pull request, using the correct `STORM-329` ticket reference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user miguno commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-72653588 @tedxia (or @clockfly): Have you experienced similar Storm behavior as [I described above](https://github.com/apache/storm/pull/268#issuecomment-72652704) in your patched production cluster? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user miguno commented on a diff in the pull request: https://github.com/apache/storm/pull/268#discussion_r24039811 --- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java --- @@ -142,6 +147,15 @@ public void run() { } }; + +connector = new Runnable() { +@Override +public void run() { +if (!closing) { +connect(); +} +} +}; long initialDelay = Math.min(30L * 1000, max_sleep_ms * max_retries); //max wait for 30s --- End diff -- Either the comment or the code are incorrect. If we want to wait a maximum of 30s, then we should use `Math.max(...)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---