Hey Jay,

Thanks for the comments. I'd be sorely tempted to default to an infinite
value for max.poll.interval.ms if we offered a separate rebalance timeout,
but it seems a little dangerous as long as we use the same setting for
both. In the worst case, a live-locked process could indefinitely keep the
entire group from rebalancing. I agree with Ewen as well that users really
ought to be thinking about liveness. But I think it would be reasonable to
increase the default I've suggested (one minute) up to 2-3 minutes. That
combined with a sensible default for max.poll.records (which is also part
of this KIP) should provide reasonable out-of-the-box behavior for most
users.

Thanks,
Jason

On Sun, Jun 19, 2016 at 7:28 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> +1 on the KIP.
>
> On Sat, Jun 18, 2016 at 11:52 AM, Ismael Juma <ism...@juma.me.uk> wrote:
>
> > If we do that, shouldn't `max.poll.records` remain with the current
> default
> > of `Integer.MAX_VALUE`?
> >
> > On Sat, Jun 18, 2016 at 5:18 PM, Jay Kreps <j...@confluent.io> wrote:
> >
> > > +1
> > >
> > > One small but important thing I think we should consider changing: I
> > think
> > > we should consider setting the default for max.poll.interval to
> infinite.
> > > Previously our definition of alive was "polls within session timeout".
> > Now
> > > our definition of alive is "pings from b/g thread w/in session
> timeout".
> > > The reality is that anything we set here as a default is going to be
> too
> > > low for some people and those people will be confused. Previously that
> > was
> > > okay because the definition of liveness required you to understand and
> > > think about this setting, so getting an error meant you needed to think
> > > harder. But now this is really an optional setting for people who care
> to
> > > enforce some bound, so introducing a possible failure/instability mode
> by
> > > guess that bound seems wrong. Instead I think users that know and care
> > > about this should set a thoughtful max.poll.interval, but we should not
> > try
> > > to guess for those who don't.
> > >
> > > This seems minor but I think we've found that defaults matter more than
> > > configs so it's worth being thoughtful.
> >
>
> This is very, very not minor. This is a really important default and I
> strongly disagree that setting it to infinite is a good idea. Liveness
> isn't something you can ignore. People like to ignore it because it makes
> writing code easier, but that just means they write broken code. We can't
> avoid giving them the rope to hang themselves (they can always override the
> setting to a very large value), but we shouldn't encourage them to do it.
>
> Looking at various connectors (beyond just Kafka itself, as I was looking
> at other frameworks), there was quite a bit of code structured roughly as:
>
> while(true) {
>   try {
>     conn = open(url);
>     records = consumer.poll();
>     for (ConsumerRecord record : records) {
>       sendRecord(conn, record);
>     }
>   } catch (ConnectionException e) {
>     //ignore and retry
>   }
> }
>
> In other words, code which will never make progress under a configuration
> or deployment error. Handling errors by ignoring them isn't rare, which
> means you'll probably get no real indication of liveness unless you make it
> explicit (i.e. you *must* call something periodically). Of course, users
> can always continue to screw themselves over by also ignoring errors *we*
> produce and still keeping their app alive, but at that point we've done all
> we can and at least we'll have attempted to shift work to another instance.
>
> I think setting a *larger* default timeout could make sense -- there's no
> reasonable way to set a default since the message size, number of messages,
> and message processing time will all vary widely by application. But really
> the important distinction is (reasonable) finite timeout vs infinite. We
> shouldn't default to something that never lets you figure out that
> something is wrong. (If we could come up with a finite maximum value, I
> would absolutely want to add that as a strict maximum, but there's no such
> value.) If someone exceeds what we choose as a default maximum, it is
> *really* in their interest to understand why they need such a large timeout
> and whether there are better solutions to their problem.
>
> I'm sure it comes across as condescending, but we actually do know better
> than many application developers. There are implications of just dropping
> these timeouts that don't end well for the app. Ignoring those error cases
> and liveness issues works fine 99% of the time, but it's important if you
> really care about resiliency of your app and when things break they'll ask
> why we set a default value that leads to such bad results. I staunchly
> believe that it's better to explain why there is complexity and how to deal
> with it than to try to hide it when it can't really be hidden.
>
> -Ewen
>
> P.S. I would invoke checked exceptions as another case where framework devs
> try to encourage app developers to avoid hanging themselves yet app devs
> are given enough rope and end up hanging themselves, but the Kafka codebase
> eschews checked exceptions, so....
>
>
>
> > >
> > > -Jay
> > >
> > > On Thu, Jun 16, 2016 at 11:44 AM, Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'd like to open the vote for KIP-62. This proposal attempts to
> address
> > > one
> > > > of the recurring usability problems that users of the new consumer
> have
> > > > faced with as little impact as possible. You can read the full
> details
> > > > here:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread
> > > > .
> > > >
> > > > After some discussion on this list, I think we were in agreement that
> > > this
> > > > change addresses a major part of the problem and we've left the door
> > open
> > > > for further improvements, such as adding a heartbeat() API or a
> > > separately
> > > > configured rebalance timeout. Thanks in advance to everyone who
> helped
> > > > review the proposal.
> > > >
> > > > -Jason
> > > >
> > >
> >
>
>
>
> --
> Thanks,
> Ewen
>

Reply via email to