GitHub user eshioji opened a pull request:

    https://github.com/apache/storm/pull/568

    [STORM-763] nimbus reassigned worker A to another machine, but other 
worker's netty client can't connect to the new worker A

    (This PR builds on [STORM-839](https://github.com/apache/storm/pull/566), 
so it includes its changes)
    
    While I'm not 100% certain if it's the same issue, I also saw these 
messages logged a lot (hundreds per second per machine).
    
    Upon investigation, I found that after firing a connection request, 
`Client.Connector` waits synchronously for the connection. Because there is 
only one scheduler thread per worker by default, this means only one connection 
request per worker process can be serviced at a time. This becomes a problem 
when connections are lost frequently for whatever reason, and is especially 
exacerbated when each connection attempt takes time.
    
    Commit `aa5c2d71` changes the code so that the `Connector` fires & forgets 
the connection request, and lets Netty's callback thread handle the rest. There 
is no blocking or expensive operation done here, so using Netty's thread should 
be fine.
    
    This change did decrease the amount of ERROR logs in my topology, but I 
still saw a fair amount. Upon investigation, I found that this is simply 
because each call to `send` produces an ERROR log. So `ad8112d10` changes the 
code so that a connection error is exactly logged twice; once when the 
connection error is detected, and the second time when the re-connection is 
concluded (either a successful re-connect, or permanent failure).
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/eshioji/storm STORM-763

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #568
    
----
commit ed8ab3ec194f19c75fc2f5c000609204f04b50e8
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-28T19:42:05Z

    Simplified the flow and removed the lock that was causing the deadlock

commit 91b8eb3840432e47b79f40abebec8304627732a8
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-28T19:46:17Z

    Bump version

commit b7d84bdc7fd3de34f45a94131cdbb6bfbd3763dc
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-28T21:27:31Z

    Remove background flushing because it doesn't seem necessary. Netty's 
Channel queues up written data on an unbounded buffer. The background flushing 
seems to have been added to avoid this, but in practice it was probably doing 
it anyways because flushMessages(), which is called by send() doesn't check for 
isWritable. Moreover, queuing on an unbounded buffer seems fine because back 
pressure is provided by MAX_PENDING_TUPLE. If OOME occurs due to this buffer 
overflowing, it seems reasonable that one has to reduce MAX_PENDING_TUPLE, 
rather than Storm trying to cope with it by dropping messages.

commit 679e42bc1e38f51c2759667b03cb45322c6a793b
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-28T21:31:35Z

    Change to a SNAPSHOT version for deployment purposes

commit 27a92e2aa3488c0203f500306e0583ff9e7e1e82
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-29T09:32:16Z

    Remove (now) dead comment and code

commit 09bf6e1b5d9d351f2a60cd9a32e0239752cf437a
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-29T10:23:46Z

    Merge branch '0.9.x-branch' into STORM-839
    
    Conflicts:
        examples/storm-starter/pom.xml
        external/storm-hbase/pom.xml
        external/storm-hdfs/pom.xml
        external/storm-kafka/pom.xml
        pom.xml
        storm-buildtools/maven-shade-clojure-transformer/pom.xml
        storm-core/pom.xml
        storm-dist/binary/pom.xml
        storm-dist/source/pom.xml

commit fdb394c158ccd87c0a06c060239766d666a8db5a
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-29T10:28:20Z

    Accidentally committed generated file

commit 36eff0a409336d9247ce2e96b52fcf9630a446fb
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-29T10:30:38Z

    Remove dead method

commit 884f496dcd06bc9413c04fdf730aca2cfb4239c6
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-29T10:32:39Z

    Remove comment line I forgot to remove

commit aa5c2d719bb3913285d4274cfcf8364df958b1ff
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-29T13:40:47Z

    Do not block in Connector. This task runs on a single (by default) thread 
that is shared among all Clients. If the task blocks, other reconnection 
requests can't be processed, resulting in a lot of messages being dropped. By 
not blocking, the thread should be able to service reconnection requests a lot 
quicker.

commit ad8112d10d662ae81498d11f78a602b97243a142
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-30T23:54:31Z

    Log error message for dropping messages only once per connection error 
(logging it everytime on send was flooding the log).

commit ee4e94a01c29caacea480b12d57a7d77174bb9be
Author: Enno Shioji <eshi...@gmail.com>
Date:   2015-05-30T23:54:52Z

    Remove obsolete metrics

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to