Re: [VOTE] KIP-62: Allow consumer to send heartbeats from a background thread

Ewen Cheslack-Postava Sun, 19 Jun 2016 19:29:33 -0700

+1 on the KIP.

On Sat, Jun 18, 2016 at 11:52 AM, Ismael Juma <ism...@juma.me.uk> wrote:

> If we do that, shouldn't `max.poll.records` remain with the current default
> of `Integer.MAX_VALUE`?
>
> On Sat, Jun 18, 2016 at 5:18 PM, Jay Kreps <j...@confluent.io> wrote:
>
> > +1
> >
> > One small but important thing I think we should consider changing: I
> think
> > we should consider setting the default for max.poll.interval to infinite.
> > Previously our definition of alive was "polls within session timeout".
> Now
> > our definition of alive is "pings from b/g thread w/in session timeout".
> > The reality is that anything we set here as a default is going to be too
> > low for some people and those people will be confused. Previously that
> was
> > okay because the definition of liveness required you to understand and
> > think about this setting, so getting an error meant you needed to think
> > harder. But now this is really an optional setting for people who care to
> > enforce some bound, so introducing a possible failure/instability mode by
> > guess that bound seems wrong. Instead I think users that know and care
> > about this should set a thoughtful max.poll.interval, but we should not
> try
> > to guess for those who don't.
> >
> > This seems minor but I think we've found that defaults matter more than
> > configs so it's worth being thoughtful.
>

This is very, very not minor. This is a really important default and I
strongly disagree that setting it to infinite is a good idea. Liveness
isn't something you can ignore. People like to ignore it because it makes
writing code easier, but that just means they write broken code. We can't
avoid giving them the rope to hang themselves (they can always override the
setting to a very large value), but we shouldn't encourage them to do it.

Looking at various connectors (beyond just Kafka itself, as I was looking
at other frameworks), there was quite a bit of code structured roughly as:

while(true) {
  try {
    conn = open(url);
    records = consumer.poll();
    for (ConsumerRecord record : records) {
      sendRecord(conn, record);
    }
  } catch (ConnectionException e) {
    //ignore and retry
  }
}

In other words, code which will never make progress under a configuration
or deployment error. Handling errors by ignoring them isn't rare, which
means you'll probably get no real indication of liveness unless you make it
explicit (i.e. you *must* call something periodically). Of course, users
can always continue to screw themselves over by also ignoring errors *we*
produce and still keeping their app alive, but at that point we've done all
we can and at least we'll have attempted to shift work to another instance.

I think setting a *larger* default timeout could make sense -- there's no
reasonable way to set a default since the message size, number of messages,
and message processing time will all vary widely by application. But really
the important distinction is (reasonable) finite timeout vs infinite. We
shouldn't default to something that never lets you figure out that
something is wrong. (If we could come up with a finite maximum value, I
would absolutely want to add that as a strict maximum, but there's no such
value.) If someone exceeds what we choose as a default maximum, it is
*really* in their interest to understand why they need such a large timeout
and whether there are better solutions to their problem.

I'm sure it comes across as condescending, but we actually do know better
than many application developers. There are implications of just dropping
these timeouts that don't end well for the app. Ignoring those error cases
and liveness issues works fine 99% of the time, but it's important if you
really care about resiliency of your app and when things break they'll ask
why we set a default value that leads to such bad results. I staunchly
believe that it's better to explain why there is complexity and how to deal
with it than to try to hide it when it can't really be hidden.

-Ewen

P.S. I would invoke checked exceptions as another case where framework devs
try to encourage app developers to avoid hanging themselves yet app devs
are given enough rope and end up hanging themselves, but the Kafka codebase
eschews checked exceptions, so....

> >
> > -Jay
> >
> > On Thu, Jun 16, 2016 at 11:44 AM, Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi All,
> > >
> > > I'd like to open the vote for KIP-62. This proposal attempts to address
> > one
> > > of the recurring usability problems that users of the new consumer have
> > > faced with as little impact as possible. You can read the full details
> > > here:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread
> > > .
> > >
> > > After some discussion on this list, I think we were in agreement that
> > this
> > > change addresses a major part of the problem and we've left the door
> open
> > > for further improvements, such as adding a heartbeat() API or a
> > separately
> > > configured rebalance timeout. Thanks in advance to everyone who helped
> > > review the proposal.
> > >
> > > -Jason
> > >
> >
>

-- 
Thanks,
Ewen

Re: [VOTE] KIP-62: Allow consumer to send heartbeats from a background thread

Reply via email to