[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

2015-06-03 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/568#issuecomment-108339300
  
And like Bobby already said:  many thanks for your continued work and 
improvements, @eshioji!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-763] nimbus reassigned worker A to anot...

2015-06-03 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/568#issuecomment-108338846
  
 @eshioji wrote:
 Also I have a question, maybe @miguno could help; I've removed the 
graceful shutdown which tries to flush all pending message before the Client is 
closed, mostly to make it easier for me to fix the deadlock. However now I'm 
worried I might have removed something significant. Do you think I should bring 
it back?

IIRC the graceful shutdown was primarily (but not exclusively) for 
non-acking topologies to minimize any potential data loss.  I think it would be 
preferable if we'd continue to allow for graceful shutdowns, if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-512 KafkaBolt doesn't handle ticks prope...

2015-04-16 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/275#issuecomment-93694857
  
I created [STORM-786](https://issues.apache.org/jira/browse/STORM-786) to 
track the tick tuple acking.  Pull request is already sent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-786: KafkaBolt should ack tick tuples

2015-04-16 Thread miguno
GitHub user miguno opened a pull request:

https://github.com/apache/storm/pull/522

STORM-786: KafkaBolt should ack tick tuples

[STORM-512](https://issues.apache.org/jira/browse/STORM-512) (KafkaBolt 
doesn't handle ticks properly) adds special-casing of tick tuples. What is 
missing in the patch is that the input tuple, when it is a tick tuple, should 
be properly acked like normal tuples.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/miguno/storm STORM-786

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #522


commit dfe3fbe144ed6398c2d2cc6210454d4c561e042e
Author: Michael G. Noll mich...@michael-noll.com
Date:   2015-04-16T09:44:40Z

STORM-786: KafkaBolt should ack tick tuples




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-512 KafkaBolt doesn't handle ticks prope...

2015-04-16 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/275#issuecomment-93699373
  
My understanding is yes, you do need to ack tick tuples.  See @nathanmarz 
[comment](https://groups.google.com/forum/#!topic/storm-user/ZEJabXT5nQA) from 
some time back:

 Do I need to ack tick tuples?  Since they're system generated I'm think 
yes but I'm not sure.

 Nathan's answer: You should ack all tuples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-512 KafkaBolt doesn't handle ticks prope...

2015-04-16 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/275#issuecomment-93690832
  
Since KafkaBolt is extending BaseRichBolt, I think we should perform a 
`collector.ack(input)` before `return`.  Tick tuples must be acked like 
normal tuples.

```
public class KafkaBoltK, V extends BaseRichBolt {

public void execute(Tuple input) {
if (TupleUtils.isTick(input)) {
  return; // Do not try to send ticks to Kafka
}
...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2015-03-17 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-82217610
  
@clockfly We may also want to check why we don't have the permissions to 
close this PR ourselves.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2015-03-16 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-81736347
  
@d2r And btw I don't have permissions on GitHub to close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-707: Client (Netty): improve logging to ...

2015-03-16 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/463#issuecomment-81739459
  
I created [STORM-707](https://issues.apache.org/jira/browse/STORM-707) to 
track this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: Client (Netty): improving logging to help trou...

2015-03-11 Thread miguno
GitHub user miguno opened a pull request:

https://github.com/apache/storm/pull/463

Client (Netty): improving logging to help troubleshooting connection woes

These logging statements are not on a hot path, and `INFO` is the default 
log level of Storm.  These logging are filling a gap that facilitates 
understanding connection woes in a Storm cluster (cf. our work on STORM-329).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/miguno/storm 
improve-closeChannelAndReconnect-logging

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #463


commit 50aac686de68f783c5767a281922f05ce05e4127
Author: Michael G. Noll mn...@verisign.com
Date:   2015-03-11T11:07:11Z

Add logging to closeChannelAndReconnect() to help with connnection 
troubleshooting

commit ea2a61d91310835a78354052889e37761a68f5cf
Author: Michael G. Noll mn...@verisign.com
Date:   2015-03-11T11:14:18Z

Add logging to connect() for corner cases (e.g. client is being closed)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-695] storm CLI tool reports zero exit c...

2015-03-04 Thread miguno
Github user miguno commented on a diff in the pull request:

https://github.com/apache/storm/pull/456#discussion_r25774104
  
--- Diff: bin/storm ---
@@ -183,7 +209,16 @@ def exec_storm_class(klass, jvmtype=-server, 
jvmopts=[], extrajars=[], args=[]
 os.spawnvp(os.P_WAIT, JAVA_CMD, all_args)
 else:
 # handling whitespaces in JAVA_CMD
-sub.call(all_args)
+try:
+ret = sub.check_output(all_args, stderr=sub.STDOUT)
+print(ret)
+except sub.CalledProcessError as e:
+# Handling exception with stdout and stderr
+if e.returncode != 0:
--- End diff --

I think this check is redundant, given what the Python docs say:

 subprocess.check_output(): [...] If the return code was non-zero it 
raises a CalledProcessError.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-695] storm CLI tool reports zero exit c...

2015-03-04 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/456#issuecomment-77159580
  
Many thanks for the PR, Kai!

 However if the topology is submitted with `storm shell`, this cannot 
track exception because spawnvp seems to return no stdout and stderr.

See [my 
comment](https://issues.apache.org/jira/browse/STORM-695?focusedCommentId=14340333page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14340333)
 in the STORM-695 JIRA ticket.  `os.spawnvp` returns the exit code of the 
process:

```python
# return code can be 0 or non-zero, i.e. there will be no exception thrown 
like for sub.check_output
return_code = os.spawnvp(os.P_WAIT, JAVA_CMD, all_args)
```

However the problem is that Storm is currently not returning any non-zero 
return codes, even in the face of errors.  I haven't had the chance to 
test-drive your code yet, but the code in the PR might not work for the same 
reason:

```python
 try:
ret = sub.check_output(all_args, stderr=sub.STDOUT)
print(ret)
  except sub.CalledProcessError as e:
```

If I understand the Python docs correctly, the `CalledProcessError` will 
only be thrown if the process returns with a non-zero exit code, and with 
current Storm this won't happen -- the return code will always be zero I think.

One workaround could be to slightly adjust the code in the PR to simply 
grep for error messages in `STDOUT` / `STDERR` --  regardless of what the 
return code of the process is -- and, if there is a match, consider the 
process/command failed.

(And please correct me if I'm wrong.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: [STORM-695] storm CLI tool reports zero exit c...

2015-03-04 Thread miguno
Github user miguno commented on a diff in the pull request:

https://github.com/apache/storm/pull/456#discussion_r25774112
  
--- Diff: bin/storm ---
@@ -183,7 +209,16 @@ def exec_storm_class(klass, jvmtype=-server, 
jvmopts=[], extrajars=[], args=[]
 os.spawnvp(os.P_WAIT, JAVA_CMD, all_args)
 else:
 # handling whitespaces in JAVA_CMD
-sub.call(all_args)
+try:
+ret = sub.check_output(all_args, stderr=sub.STDOUT)
+print(ret)
--- End diff --

Why the print?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-24 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75713928
  
Oh, Taylor. Could you also update STORM-404 and STORM-510 as appropriate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-23 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-75634123
  
Phew. :-)

Thanks for merging, Taylor!



 On 23.02.2015, at 22:06, P. Taylor Goetz notificati...@github.com wrote:
 
 Disregard last message. It was a merge mistake (picked right when I 
should have picked left).
 
 All tests are passing now.
 
 —
 Reply to this email directly or view it on GitHub.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74927415
  
@nathanmarz Thanks for the detailed feedback on the max-retries issue.  As 
Bobby suggested, would you mind if we decouple the work on max-retries (tracked 
at STORM-677) from this pull request / STORM-329?  The max-retries issue has 
been a problem in Storm for a while, since before this pull request.  
Decoupling would also help us with backtracking any future issues to either 
ticket.  Right now, I feel there's almost too much meat in this pull request 
already (as we kinda covered STORM-329, STORM-404, and STORM-510 in one big 
swing).

Would that be ok for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74894542
  
FYI: I created [STORM-677: Maximum retries strategy may cause data 
loss](https://issues.apache.org/jira/browse/STORM-677) to address the issue 
that Bobby brought up in this discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74895041
  
PS: We may also want to update the original 
[STORM-329](https://issues.apache.org/jira/browse/STORM-329) ticket description 
to reflect the changes in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-18 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74898690
  
Thanks, Taylor!  Let me know if I can help with sorting out the test 
failures.

Also regarding JIRA: I forgot to mention that it looks like we need to 
update STORM-404 and STORM-510 as well as STORM-329 has been said to cover 
those issues, too.  It looks to me like this is actually the case, but I'd 
prefer another pair of eyes to be sure we're not mistakenly closing two related 
tickets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74702215
  
And FWIW, with the code in this PR the total test suite takes about 5mins 
to complete.

```
$ mvn clean install
...

[INFO] 

[INFO] Reactor Summary:
[INFO]
[INFO] Storm .. SUCCESS [  
1.696 s]
[INFO] maven-shade-clojure-transformer  SUCCESS [  
1.930 s]
[INFO] storm-maven-plugins  SUCCESS [  
2.033 s]
[INFO] Storm Core . SUCCESS [03:56 
min]
[INFO] storm-starter .. SUCCESS [  
6.616 s]
[INFO] storm-kafka  SUCCESS [ 
48.469 s]
[INFO] storm-hdfs . SUCCESS [  
1.875 s]
[INFO] storm-hbase  SUCCESS [  
2.271 s]
[INFO] 

[INFO] BUILD SUCCESS
[INFO] 

[INFO] Total time: 05:02 min
[INFO] Finished at: 2015-02-17T17:49:19+01:00
[INFO] Final Memory: 67M/282M
[INFO] 

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-17 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74701138
  
 I am seeing a lot of tests timing out with this change. Has anyone else 
seen this?

Hmm.  All the tests are passing for me (and they have been since a while).

Do you have any pointers?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-16 Thread miguno
Github user miguno commented on a diff in the pull request:

https://github.com/apache/storm/pull/429#discussion_r24738502
  
--- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java ---
@@ -42,344 +42,577 @@
 import org.slf4j.LoggerFactory;
 
 import backtype.storm.Config;
+import backtype.storm.messaging.ConnectionWithStatus;
 import backtype.storm.metric.api.IStatefulObject;
-import backtype.storm.messaging.IConnection;
 import backtype.storm.messaging.TaskMessage;
 import backtype.storm.utils.StormBoundedExponentialBackoffRetry;
 import backtype.storm.utils.Utils;
 
-public class Client implements IConnection, IStatefulObject{
+/**
+ * A Netty client for sending task messages to a remote destination (Netty 
server).
+ *
+ * Implementation details:
+ *
+ * - Sending messages, i.e. writing to the channel, is performed 
asynchronously.
+ * - Messages are sent in batches to optimize for network throughput at 
the expense of network latency.  The message
+ *   batch size is configurable.
+ * - Connecting and reconnecting are performed asynchronously.
+ * - Note: The current implementation drops any messages that are 
being enqueued for sending if the connection to
+ *   the remote destination is currently unavailable.
+ * - A background flusher thread is run in the background.  It will, at 
fixed intervals, check for any pending messages
+ *   (i.e. messages buffered in memory) and flush them to the remote 
destination iff background flushing is currently
+ *   enabled.
+ */
+public class Client extends ConnectionWithStatus implements 
IStatefulObject {
+
 private static final Logger LOG = 
LoggerFactory.getLogger(Client.class);
 private static final String PREFIX = Netty-Client-;
-private final int max_retries;
-private final int base_sleep_ms;
-private final int max_sleep_ms;
+private static final long NO_DELAY_MS = 0L;
+private static final long MINIMUM_INITIAL_DELAY_MS = 3L;
+private static final long PENDING_MESSAGES_FLUSH_TIMEOUT_MS = 60L;
+private static final long PENDING_MESSAGES_FLUSH_INTERVAL_MS = 1000L;
+private static final long DISTANT_FUTURE_TIME_MS = Long.MAX_VALUE;
+
 private final StormBoundedExponentialBackoffRetry retryPolicy;
-private AtomicReferenceChannel channelRef;
 private final ClientBootstrap bootstrap;
-private InetSocketAddress remote_addr;
-
-private AtomicInteger totalReconnects;
-private AtomicInteger messagesSent;
-private AtomicInteger messagesLostReconnect;
-private final Random random = new Random();
-private final ChannelFactory factory;
-private final int buffer_size;
-private boolean closing;
-
-private int messageBatchSize;
-
-private AtomicLong pendings;
-
-Map storm_conf;
+private final InetSocketAddress dstAddress;
+protected final String dstAddressPrefixedName;
+
+/**
+ * The channel used for all write operations from this client to the 
remote destination.
+ */
+private final AtomicReferenceChannel channelRef = new 
AtomicReferenceChannel(null);
+
+
+/**
+ * Maximum number of reconnection attempts we will perform after a 
disconnect before giving up.
+ */
+private final int maxReconnectionAttempts;
+
+/**
+ * Total number of connection attempts.
+ */
+private final AtomicInteger totalConnectionAttempts = new 
AtomicInteger(0);
+
+/**
+ * Number of connection attempts since the last disconnect.
+ */
+private final AtomicInteger connectionAttempts = new AtomicInteger(0);
+
+/**
+ * Number of messages successfully sent to the remote destination.
+ */
+private final AtomicInteger messagesSent = new AtomicInteger(0);
+
+/**
+ * Number of messages that could not be sent to the remote destination.
+ */
+private final AtomicInteger messagesLost = new AtomicInteger(0);
+
+/**
+ * Number of messages buffered in memory.
+ */
+private final AtomicLong pendingMessages = new AtomicLong(0);
+
+/**
+ * This flag is set to true if and only if a client instance is being 
closed.
+ */
+private volatile boolean closing = false;
+
+/**
+ * When set to true, then the background flusher thread will flush any 
pending messages on its next run.
+ */
+private final AtomicBoolean backgroundFlushingEnabled = new 
AtomicBoolean(false);
+
+/**
+ * The absolute time (in ms) when the next background flush should be 
performed.
+ *
+ * Note

[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74147926
  
If you need at-least-once processing you must use an acking topology, which 
will allow Storm to replay lost messages.  If instead you go with an unacking 
topology (= no guaranteed message processing) then you may run into data loss.  
There re pros and cons for each variant, and e.g. in our case we use both 
depending on the use case.

Also: The semantics described above have been in Storm right from the 
beginning.  None of these have been changed by this pull request.


 On 12.02.2015, at 20:01, Daniel Schonfeld notificati...@github.com 
wrote:
 
 Doesn't dropping the messages coming from a non ack/fail caring spout 
negate the 'at least once' attempt of storm? I mean doesn't that kinda force 
you to make all your spouts ack/fail aware where before you could have gotten 
away without it?
 
 In other words. There is a chance that if the worker that died is the one 
containing the spout and if the first bolt is located on another worker, that 
technically at-least once wasn't tried but rather fell to the floor right away.
 
 —
 Reply to this email directly or view it on GitHub.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-12 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/429#issuecomment-74032874
  
This patch allows a worker to properly detect that the connection to a peer 
becomes unavailable -- for whatever reason (the remote worker is dead or 
restarting, there was a network glitch, etc). Also, any reconnection attempts 
are now async so that reconnecting will not block other activities of the 
worker (such as sending messages to other workers it is still connected to).

So to your question: this patch includes the case of the worker that 
remained alive trying to (re)connect to a dead peer.

Does that help?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
GitHub user miguno opened a pull request:

https://github.com/apache/storm/pull/428

STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

This is an improved version of the original pull request discussed at 
https://github.com/apache/storm/pull/268.  Please refer to the discussion in 
the link above.

**Note**:  Please give attribution to @tedxia when merging the pull request 
as he did a lot (most?) of the work in this pull request.

The changes of this pull request include:

- Most importantly, we fix a bug in Storm that may cause a cascading 
failure in a Storm cluster, to the point where the whole cluster becomes 
unusable.  This is achieved by the work described in the next bullet points.
- We refactor and improve the Netty messaging backend, notably the client.
- During the initial startup of a topology, Storm will now wait until 
worker (Netty) connections are ready for operation.  See the [original 
discussion thread](https://github.com/apache/storm/pull/268) for the detailed 
explanation and justification of this change.

@clockfly, @tedxia: Please add any further comments to STORM-329 to this 
pull request, if possible.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/428.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #428


commit 8c52a5b518021a6beff372acbeb66a963a1d4f74
Author: xiajun xia...@xiaomi.com
Date:   2014-09-24T12:39:18Z

STORM-329 : buffer message in client and reconnect remote server async

commit 7cc8c8b1b59d415f1fa54e081127336fcdaeb706
Author: xiajun xia...@xiaomi.com
Date:   2014-09-26T02:43:55Z

STORM-329: fix continue flush after client had been closed

commit 9826bc9ef7aee8c83e90d56f52ac70ac7165d769
Author: xiajun xia...@xiaomi.com
Date:   2014-09-30T03:25:08Z

Client not clean timeout TaskMessage

commit 978c969fdb4b5904b6a87c100fbd80fe26bf39cf
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-20T01:08:43Z

Merge remote-tracking branch 'upstream/master'

commit 44f8260bbf489c6a2741fb4d8f9196ea6ddb51cc
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-20T01:21:25Z

STORM-404, STORM-510, STORM-329: 1. break the reconnection to target if 
worker is informed that the target is the down, we that we avoid 
RuntimeException when reconnection failed. 2. When worker get started, need to 
make sure all target workers are alive before launching spouts 3. When target 
worker is down, all messages sending to the target worker will be dropped.

commit dea5fbe35c4d9b18a89dae320f9fc985f25bd31a
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-20T01:24:45Z

STORM-329: fix comment grammar

commit 16de9f3321827624865a4450e93bd53efb75ed93
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-28T03:31:23Z

test

commit e8dcf9155c85d3541c9352faf9a0651614b93eb6
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-29T01:56:19Z

STORM-329: avoid deadlock

commit 22e7014dedfd580de9dd1d6b2083c8fb3d77d406
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-29T01:59:09Z

Revert test

This reverts commit 16de9f3321827624865a4450e93bd53efb75ed93.

commit ddef6667cdc6de9aebf0d4006ad9e7df2bfbb3bb
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-29T08:14:34Z

Merge remote-tracking branch 'upstream/master'

commit baf3c628db3899def89ee92c752524d041bc8b40
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:17:07Z

STORM-329: fix UT. Add a new flag in worker data worker-active-flag, Wait 
connections to be ready asyncly.

commit e1c463f5681dbf6a66c868d467aca14064da1e9b
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:22:22Z

STORM-329: fix comments. Add a description about 
storm.messaging.netty.max_retries, that the reconnection period should also 
be bigger than storm.zookeeper.session.timeout, so that the reconnection can 
be aborted(when target worker is dead) before the reconnection failed and throw 
RunTimeException

commit 2d3fad121481da40258af27e6d7fbcb148365e76
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:22:22Z

STORM-329: fix a integration issue

commit 60f04f9e397cf49e4fba6fe6a2f0bfb23d5a8605
Author: Sean Zhong clock...@gmail.com
Date:   2014-10-30T10:28:17Z

Merge branch 'master' of https://github.com/tedxia/incubator-storm

commit 41aafbecac2cf3295255c7dc9b299b8c0c555390
Author: xiajun xia...@xiaomi.com
Date:   2014-11-18T06:53:14Z

Merge remote-tracking branch 'remotes/apache-storm/0.9.3-branch' into 
ted-master

Conflicts:
storm-core/src/jvm/backtype/storm/messaging/netty/Client.java

commit 2c39866cf8bbca136d3b88f796b9f847b282fdd7
Author: xiajun xia

[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/428#issuecomment-73946399
  
I did exactly this in #429.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
Github user miguno closed the pull request at:

https://github.com/apache/storm/pull/428


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/428#issuecomment-73928303
  
I tried rebasing (also to fix the incorrect commit message that starts with 
STORM-32*7*) but gave up after several failed attempts.  Feel free to give it a 
try though -- maybe your git-fu is better than mine. :-)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329: fix cascading Storm failure by impr...

2015-02-11 Thread miguno
GitHub user miguno opened a pull request:

https://github.com/apache/storm/pull/429

STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

**This PR contains the same code as 
https://github.com/apache/storm/pull/428 but as a single commit for a cleaner 
commit history of our Storm repo.**

--

This is an improved version of the original pull request discussed at 
https://github.com/apache/storm/pull/268.  Please refer to the discussion in 
the link above.

**Note**:  Please give attribution to @tedxia when merging the pull request 
as he did a lot (most?) of the work in this pull request.

The changes of this pull request include:

- Most importantly, we fix a bug in Storm that may cause a cascading 
failure in a Storm cluster, to the point where the whole cluster becomes 
unusable.  This is achieved by the work described in the next bullet points.
- We refactor and improve the Netty messaging backend, notably the client.
- During the initial startup of a topology, Storm will now wait until 
worker (Netty) connections are ready for operation.  See the [original 
discussion thread](https://github.com/apache/storm/pull/268) for the detailed 
explanation and justification of this change.

@clockfly, @tedxia: Please add any further comments to STORM-329 to this 
pull request, if possible.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/miguno/storm 0.10.0-SNAPSHOT-STORM-329-diff

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/storm/pull/429.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #429


commit 205eaf4ebe28ab5550a842ea9aabd23b41678743
Author: Michael G. Noll mn...@verisign.com
Date:   2015-02-11T18:55:53Z

STORM-329: fix cascading Storm failure by improving reconnection strategy 
and buffering messages

Thanks to @tedxia for the initial work on this patch, which covered a
lot if not most of the work!




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2015-02-10 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-73667341
  
Many thanks for your review, Sean.  I addressed your comments, see the new 
commits in 
https://github.com/miguno/storm/commits/0.10.0-SNAPSHOT-STORM-392-miguno-merge.

One final question:  How should we address the following TODO in 
`send(IteratorTaskMessage msgs)`?  I'd appreciate additional eyeballs on 
this. :-)  It's a scenario that we may or may not have overlooked in the 
original code.  Note how we are NOT checking for a `WRITABLE` channel while 
flushing (full) message batches (cf. the `while` loop), but we do check for a 
`WRITABLE` channel when handling any left-over messages (cf. after the `while` 
loop).

```java
/**
 * Enqueue task messages to be sent to the remote destination (cf. 
`host` and `port`).
 */
@Override
public synchronized void send(IteratorTaskMessage msgs) {

// ...some code removed to shorten this code snippet...

Channel channel = channelRef.get();
if (!connectionEstablished(channel)) {
// Closing the channel and reconnecting should be done before 
handling the messages.
closeChannelAndReconnect(channel);
handleMessagesWhenConnectionIsUnavailable(msgs);
return;
}

// Collect messages into batches (to optimize network throughput), 
then flush them.
while (msgs.hasNext()) {
TaskMessage message = msgs.next();
if (messageBatch == null) {
messageBatch = new MessageBatch(messageBatchSize);
}

messageBatch.add(message);
// TODO: What shall we do if the channel is not writable?
if (messageBatch.isFull()) {
MessageBatch toBeFlushed = messageBatch;
flushMessages(channel, toBeFlushed);
messageBatch = null;
}
}

// Handle any remaining messages in case the last batch was not 
full.
if (containsMessages(messageBatch)) {
if (connectionEstablished(channel)  channel.isWritable()) {
// We can write to the channel, so we flush the remaining 
messages immediately to minimize latency.
pauseBackgroundFlushing();
MessageBatch toBeFlushed = messageBatch;
messageBatch = null;
flushMessages(channel, toBeFlushed);
}
else {
// We cannot write to the channel, which means Netty's 
internal write buffer is full.
// In this case, we buffer the remaining messages and wait 
for the next messages to arrive.
//
// Background:
// Netty 3.x maintains an internal write buffer with a high 
water mark for each channel (default: 64K).
// This represents the amount of data waiting to be flushed 
to operating system buffers.  If the
// outstanding data exceeds this value then the channel is 
set to non-writable.  When this happens, a
// INTEREST_CHANGED channel event is triggered.  Netty sets 
the channel to writable again once the data
// has been flushed to the system buffers.
//
// See http://stackoverflow.com/questions/14049260
resumeBackgroundFlushing();
nextBackgroundFlushTimeMs.set(nowMillis() + 
flushCheckIntervalMs);
}
}
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2015-02-10 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-73671537
  
PS: I just noticed that I put the wrong Storm ticket into the commit 
https://github.com/miguno/storm/commit/8ebaaf8dbc63df3c2691e0cc3ac5102af7721ec3.
  The `STORM-327` prefix of the commit message should have been `STORM-329`.

I will keep the current code/commits as is in order to not disrupt the 
ongoing discussion in this thread.  Once we're happy with the code I will 
create a separate pull request, using the correct `STORM-329` ticket reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2015-02-03 Thread miguno
Github user miguno commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-72653588
  
@tedxia (or @clockfly): Have you experienced similar Storm behavior as [I 
described 
above](https://github.com/apache/storm/pull/268#issuecomment-72652704) in your 
patched production cluster?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2015-02-03 Thread miguno
Github user miguno commented on a diff in the pull request:

https://github.com/apache/storm/pull/268#discussion_r24039811
  
--- Diff: storm-core/src/jvm/backtype/storm/messaging/netty/Client.java ---
@@ -142,6 +147,15 @@ public void run() {
 
 }
 };
+
+connector = new Runnable() {
+@Override
+public void run() {
+if (!closing) {
+connect();
+}
+}
+};
 
 long initialDelay = Math.min(30L * 1000, max_sleep_ms * 
max_retries); //max wait for 30s
--- End diff --

Either the comment or the code are incorrect.

If we want to wait a maximum of 30s, then we should use `Math.max(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---