Jason, yeah I think raising to 3 minutes (or better yet 5 mins) would be
better since fewer people would hit it. I do think guessing is going to be
kind of annoying here since you will eventually hit the limit in prod and
curse us since it wasn't a limit you intended to set, but at least setting
a high default should make this unlikely.

Ewen, I agree people should ideally think about and set this limit. I guess
my feeling is that for people who do think about it, the default can be
infinite with no damage, they will set something reasonable and know what
to do if they hit it. For those who don't I just think our guessing a value
for all applications and not telling you is kind of annoying since it will
go off, probably in production, and you will be confused as to what is
happening.

One thing maybe we can agree is that it would help if the error message
tells you what config controls this so you have some hope of figuring out
what happened. Typically we have good intentions with these things but all
a reasonably informed person who didn't write the code can actually
conclude is something is broken, probably because Kafka is buggy (this was
the problem with the consumer originally--most of the confused people
weren't running up against the impossibility of configuring this value,
they just couldn't figure out what the heck was going on).

Also, checked exceptions? Really Ewen??? :-)

-Jay



On Mon, Jun 20, 2016 at 11:58 AM, Jason Gustafson <ja...@confluent.io>
wrote:

> Hey Jay,
>
> Thanks for the comments. I'd be sorely tempted to default to an infinite
> value for max.poll.interval.ms if we offered a separate rebalance timeout,
> but it seems a little dangerous as long as we use the same setting for
> both. In the worst case, a live-locked process could indefinitely keep the
> entire group from rebalancing. I agree with Ewen as well that users really
> ought to be thinking about liveness. But I think it would be reasonable to
> increase the default I've suggested (one minute) up to 2-3 minutes. That
> combined with a sensible default for max.poll.records (which is also part
> of this KIP) should provide reasonable out-of-the-box behavior for most
> users.
>
> Thanks,
> Jason
>
> On Sun, Jun 19, 2016 at 7:28 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > +1 on the KIP.
> >
> > On Sat, Jun 18, 2016 at 11:52 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > If we do that, shouldn't `max.poll.records` remain with the current
> > default
> > > of `Integer.MAX_VALUE`?
> > >
> > > On Sat, Jun 18, 2016 at 5:18 PM, Jay Kreps <j...@confluent.io> wrote:
> > >
> > > > +1
> > > >
> > > > One small but important thing I think we should consider changing: I
> > > think
> > > > we should consider setting the default for max.poll.interval to
> > infinite.
> > > > Previously our definition of alive was "polls within session
> timeout".
> > > Now
> > > > our definition of alive is "pings from b/g thread w/in session
> > timeout".
> > > > The reality is that anything we set here as a default is going to be
> > too
> > > > low for some people and those people will be confused. Previously
> that
> > > was
> > > > okay because the definition of liveness required you to understand
> and
> > > > think about this setting, so getting an error meant you needed to
> think
> > > > harder. But now this is really an optional setting for people who
> care
> > to
> > > > enforce some bound, so introducing a possible failure/instability
> mode
> > by
> > > > guess that bound seems wrong. Instead I think users that know and
> care
> > > > about this should set a thoughtful max.poll.interval, but we should
> not
> > > try
> > > > to guess for those who don't.
> > > >
> > > > This seems minor but I think we've found that defaults matter more
> than
> > > > configs so it's worth being thoughtful.
> > >
> >
> > This is very, very not minor. This is a really important default and I
> > strongly disagree that setting it to infinite is a good idea. Liveness
> > isn't something you can ignore. People like to ignore it because it makes
> > writing code easier, but that just means they write broken code. We can't
> > avoid giving them the rope to hang themselves (they can always override
> the
> > setting to a very large value), but we shouldn't encourage them to do it.
> >
> > Looking at various connectors (beyond just Kafka itself, as I was looking
> > at other frameworks), there was quite a bit of code structured roughly
> as:
> >
> > while(true) {
> >   try {
> >     conn = open(url);
> >     records = consumer.poll();
> >     for (ConsumerRecord record : records) {
> >       sendRecord(conn, record);
> >     }
> >   } catch (ConnectionException e) {
> >     //ignore and retry
> >   }
> > }
> >
> > In other words, code which will never make progress under a configuration
> > or deployment error. Handling errors by ignoring them isn't rare, which
> > means you'll probably get no real indication of liveness unless you make
> it
> > explicit (i.e. you *must* call something periodically). Of course, users
> > can always continue to screw themselves over by also ignoring errors *we*
> > produce and still keeping their app alive, but at that point we've done
> all
> > we can and at least we'll have attempted to shift work to another
> instance.
> >
> > I think setting a *larger* default timeout could make sense -- there's no
> > reasonable way to set a default since the message size, number of
> messages,
> > and message processing time will all vary widely by application. But
> really
> > the important distinction is (reasonable) finite timeout vs infinite. We
> > shouldn't default to something that never lets you figure out that
> > something is wrong. (If we could come up with a finite maximum value, I
> > would absolutely want to add that as a strict maximum, but there's no
> such
> > value.) If someone exceeds what we choose as a default maximum, it is
> > *really* in their interest to understand why they need such a large
> timeout
> > and whether there are better solutions to their problem.
> >
> > I'm sure it comes across as condescending, but we actually do know better
> > than many application developers. There are implications of just dropping
> > these timeouts that don't end well for the app. Ignoring those error
> cases
> > and liveness issues works fine 99% of the time, but it's important if you
> > really care about resiliency of your app and when things break they'll
> ask
> > why we set a default value that leads to such bad results. I staunchly
> > believe that it's better to explain why there is complexity and how to
> deal
> > with it than to try to hide it when it can't really be hidden.
> >
> > -Ewen
> >
> > P.S. I would invoke checked exceptions as another case where framework
> devs
> > try to encourage app developers to avoid hanging themselves yet app devs
> > are given enough rope and end up hanging themselves, but the Kafka
> codebase
> > eschews checked exceptions, so....
> >
> >
> >
> > > >
> > > > -Jay
> > > >
> > > > On Thu, Jun 16, 2016 at 11:44 AM, Jason Gustafson <
> ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I'd like to open the vote for KIP-62. This proposal attempts to
> > address
> > > > one
> > > > > of the recurring usability problems that users of the new consumer
> > have
> > > > > faced with as little impact as possible. You can read the full
> > details
> > > > > here:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread
> > > > > .
> > > > >
> > > > > After some discussion on this list, I think we were in agreement
> that
> > > > this
> > > > > change addresses a major part of the problem and we've left the
> door
> > > open
> > > > > for further improvements, such as adding a heartbeat() API or a
> > > > separately
> > > > > configured rebalance timeout. Thanks in advance to everyone who
> > helped
> > > > > review the proposal.
> > > > >
> > > > > -Jason
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>

Reply via email to