dlogothetis opened a new pull request #118: Fix issues with channel re-connection URL: https://github.com/apache/giraph/pull/118 - The LogOnErrorChannelFutureListener is called when a channel operation was complete and it was checking whether the channel failed, in which case it tried to resend any requests. Doing this required to wait until a channel had been re-established. However, doing a wait operation from the same thread that calls the handler, causes a BlockingOperationException from Netty. So this was not effective. - I removed the call to the method that waits to re-establish the connection and send any requests. Besides, we already have a thread that periodically checks and re-sends any unsent requests, and also re-establishes any closed channels. - Upon a channel closing, we have logic that will try to re-open the channels doing a max number of retries.. But we also had logic in the ChannelRoterator that would throw an exception if we didn't find any channel. This does not give the opportunity to re-conenct. So I removed this. - Whenever the client closes the connection, the server catches this (Connection reset by peer) and throws an exception as well, so the job fails immediately. This does not give the opportunity to the client to re-connect. I changed this so that whenever a server sees a "Connection reset by peer" exception, it does not fail. Still failing in all other cases. Tests - Unit tests - Snapshot tests - Ran with job that would consistently fail due to connection errors, which now succeeds.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services