The previous fix did not improve anything on 2-miniute-timeout. On Wed, Sep 5, 2018 at 10:52 AM, Anthony Baker <aba...@pivotal.io> wrote:
> Gester, > > Clearly the prior implementation had some problems, but except in > pathological cases it provided the behavior users expected. That’s why I > think we need a characterization test(s) to show exactly what we want the > behavior to be. Merging in changes that make the user experience worse in > the more common scenarios isn’t a good tradeoff IMO. I see this work as > integral to GEODE-5591 and shouldn’t be deferred to a separate ticket. > > Anthony > > > > On Sep 5, 2018, at 10:43 AM, Xiaojian Zhou <gz...@pivotal.io> wrote: > > > > The fix intend to resolve 2 issues: > > 1) change the exception handling (for a linux version). > > 2) prevent random picking port number to loop forever. In old code, for > > example, if the range only contains one port, random will always pick the > > same port and it will loop forever. The fix will stop after all available > > ports in the range are tried. There's a test > > > > test_ValidateGatewayReceiverAttributes_WrongBindAddress > > > > > > For 2-minute-wait, it's still possible. The fix did not resolve it > > (when random() happened to return same port for different receiver in > > the same member), but I did not make things worse either. > > > > > > There's discussion on if we can reduce the 2-minute-timeout to a few > > second. This is definitely another ticket. > > > > Regards > > > > Gester > > > > > > On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker <aba...@pivotal.io> > wrote: > > > >> Before this improvement is re-merged I’d like to see: > >> > >> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 > min > >> when there’s a port conflict) > >> 2) A test that demonstrates how the current logic is insufficient > >> > >> Anthony > >> > >> > >>> On Sep 5, 2018, at 10:20 AM, Nabarun Nag <n...@apache.org> wrote: > >>> > >>> GEODE-5591 has been reverted in develop > >>> ref: 901da27f227a8ce2b7d6b681619782a1accd9330 > >>> > >>> Regards > >>> Nabarun Nag > >>> > >>> On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon <rmcma...@pivotal.io> > >> wrote: > >>> > >>>> +1 for reverting in both places. > >>>> > >>>> I see that there is already an isGatewayReceiver flag in the > >> AcceptorImpl > >>>> constructor. It's not ideal, but could we use this flag to prevent > the > >> 2 > >>>> minute retry logic for happening if this flag is true? > >>>> > >>>> Ryan > >>>> > >>>> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey < > >>>> lhughesgodf...@pivotal.io> wrote: > >>>> > >>>>> +1 for reverting in both places. > >>>>> > >>>>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith <dsm...@pivotal.io> wrote: > >>>>> > >>>>>> +1 for reverting in both places. The current fix is not better, > that's > >>>>> why > >>>>>> we are reverting it on the release branch! > >>>>>> > >>>>>> -Dan > >>>>>> > >>>>>> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett <jbarr...@pivotal.io> > >>>>> wrote: > >>>>>> > >>>>>>> I’m not ok with reverting in develop. Revert in 1.7 and modify in > >>>>>> develop. > >>>>>>> We shouldn’t go backwards in develop. The current fix is better > than > >>>>> the > >>>>>>> bug it fixes. > >>>>>>> > >>>>>>>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag <n...@apache.org> wrote: > >>>>>>>> > >>>>>>>> If everyone is okay with it, I will revert that change in develop > >>>> and > >>>>>>> then > >>>>>>>> cherry pick it to release/1.7.0 branch. > >>>>>>>> Please do comment. > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> Nabarun Nag > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith <dsm...@pivotal.io> > >>>> wrote: > >>>>>>>>> > >>>>>>>>> +1 to yank it and rework the fix. > >>>>>>>>> > >>>>>>>>> Gester's change helps, but it just means that you will sometimes > >>>>>>> randomly > >>>>>>>>> have a 2 minute delay starting up a gateway receiver. I don't > >>>> think > >>>>>>> that is > >>>>>>>>> a great user experience either. > >>>>>>>>> > >>>>>>>>> -Dan > >>>>>>>>> > >>>>>>>>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt < > >>>>>>> bschucha...@pivotal.io> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Let's yank it > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On 9/4/18 5:04 PM, Sean Goller wrote: > >>>>>>>>>>> > >>>>>>>>>>> If it's to get the release out, I'm fine with reverting. I > don't > >>>>>> like > >>>>>>>>> it, > >>>>>>>>>>> but I'm not willing to die on that hill. :) > >>>>>>>>>>> > >>>>>>>>>>> -S. > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith <dsm...@pivotal.io> > >>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Spitting this into a separate thread. > >>>>>>>>>>>> > >>>>>>>>>>>> I see the issue. The two minute timeout is the constructor for > >>>>>>>>>>>> AcceptorImpl, where it retries to bind for 2 minutes. > >>>>>>>>>>>> > >>>>>>>>>>>> That behavior makes sense for CacheServer.start. > >>>>>>>>>>>> > >>>>>>>>>>>> But it doesn't make sense for the new logic in > >>>>>>> GatewayReceiver.start() > >>>>>>>>>>>> from > >>>>>>>>>>>> GEODE-5591. That code is trying to use CacheServer.start to > >>>> scan > >>>>>> for > >>>>>>> an > >>>>>>>>>>>> available port, trying each port in a range. That free port > >>>>> finding > >>>>>>>>> logic > >>>>>>>>>>>> really doesn't want to have two minutes of retries for each > >>>> port. > >>>>>> It > >>>>>>>>>>>> seems > >>>>>>>>>>>> like we need to rework the fix for GEODE-5591. > >>>>>>>>>>>> > >>>>>>>>>>>> Does it make sense to hold up the release to rework this fix, > >>>> or > >>>>>>> should > >>>>>>>>>>>> we > >>>>>>>>>>>> just revert it? Have we switched concourse over to using > alpine > >>>>>>> linux, > >>>>>>>>>>>> which I think was the original motivation for this fix? > >>>>>>>>>>>> > >>>>>>>>>>>> -Dan > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith <dsm...@pivotal.io> > >>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Why is it waiting at all in this case? Where is this 2 minute > >>>>>> timeout > >>>>>>>>>>>>> coming from? > >>>>>>>>>>>>> > >>>>>>>>>>>>> -Dan > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda < > >>>>>>>>>>>>> > >>>>>>>>>>>> sai.boorlaga...@gmail.com > >>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> So the issue is that it takes longer to start than previous > >>>>>>> releases? > >>>>>>>>>>>>>> Also, is this wait time only when using Gfsh to create > >>>>>>>>>>>>>> gateway-receiver? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun Nag <n...@apache.org > > > >>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Currently we have a minor issue in the release branch as > >>>>> pointed > >>>>>>> out > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> by > >>>>>>>>>>>> > >>>>>>>>>>>>> Barry O. > >>>>>>>>>>>>>>> We will wait till a resolution is figured out for this > >>>> issue. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Steps: > >>>>>>>>>>>>>>> 1. create locator > >>>>>>>>>>>>>>> 2. start server --name=server1 --server-port=40404 > >>>>>>>>>>>>>>> 3. start server --name=server2 --server-port=40405 > >>>>>>>>>>>>>>> 4. create gateway-receiver --member=server1 > >>>>>>>>>>>>>>> 5. create gateway-receiver --member=server2 `This gets > stuck > >>>>>> for 2 > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> minutes` > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Is the 2 minute wait time acceptable? Should we document > it? > >>>>>> When > >>>>>>> we > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> revert > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> GEODE-5591, this issue does not happen. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards > >>>>>>>>>>>>>>> Nabarun Nag > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >