The thought is that a server would not complain about connection refused or inability to form a quorum during the first (say) twenty seconds of operation.
The thesis is that warnings from these causes during that time are spurious. As I mentioned, I don't see this as urgent or even necessarily a good idea. I completely reboot a ZK cluster once every year or three. When I am doing a rolling upgrade, I *want* to see alerts when I bounce a machine. If I don't want to see those alerts, my monitoring system allows me to put a machine into maintenance mode for a short period of time to temporarily suppress the warnings. All I was doing was translating and elaborating the original poster's suggestion, not so much endorsing it. On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <[email protected]> wrote: > Hi Ted, I don't see how one can automate the distinction between a machine > that is down because it crashed and a machine that is down because it hasn't > started yet. Assuming that we are logging the machine unavailability as we > are doing currently, one can always look at the timestamp of the warning and > remember that this is the time the machines were bootstrapping. > Consequently, I don't really see the point of reducing the number of > warnings, unless the warnings are really polluting the logs. I typically > don't see so many that prevents me from reading the rest, but you may have a > different perception. Also, recall that we back off, so the warnings become > less frequent over time. > > I'm open to ideas, though. If you see anything wrong in my rationale or if > you have an idea of how to do it differently, then I'd be happy to hear. > However, if the idea is simply to add a parameter that configures the time > for leader election to start, then I'm currently not in favor. > > -Flavio > > On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: > > Flavio, > > What you say is correct, but the original poster does have a point that > many > of these warnings are to be expected and there is a heuristic that might > assist in distinguishing some of these cases so that false alarms in the > logs could be decreased. > > That doesn't seem like a big deal to me, but different people have > different > itches. In my experience, restarting a ZK cluster from zero almost never > happens. > > On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[email protected]> > wrote: > > > > On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[email protected] > >wrote: > > > > Hhmmm, I think this is a bit different isn't it? Here we know that the > > first > > server to come will be failing to connect to the other as they are not yet > > up. Anyway our real issue is the warning. > > > > We know that. > > > But how does the server know that it is the first server? That is the > > whole point of the leader election. You might just have a server rejoining > > a cluster. Or you might have a cluster that has been turned off. Or a > > cluster with 2 out of 5 machines off and we tried to touch the other down > > machine before the others. > > > > > Would you like to suggest a patch? > > > > Of course I do.. will prepare a patch and attach. > > > > Great! > > > > > *flavio* > *junqueira* > > research scientist > > [email protected] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > > >
