C0urante commented on PR #14562:
URL: https://github.com/apache/kafka/pull/14562#issuecomment-1771701339

   Thanks Greg. To be clear, this isn't really a first pass--it's the first 
published one :)
   
   With regards to giving each request a single scope--this would fail to 
capture blocking operations outside of the request (such as other requests, 
ensuring group membership, and reading to the end of the config topic). I 
suppose we could do something similar where we decompose `tick()` into a series 
of either 1) method invocations that have a single scope or 2) herder requests 
that have their own scopes (at least one, possibly more).
   
   As far as breaking down herder requests into several smaller requests goes, 
I don't think it's wise to do this when not absolutely necessary. The only 
reason we added that kind of decomposition in 
https://github.com/apache/kafka/pull/8069 was to avoid blocking on interactions 
with connector plugins that could otherwise disable the herder's work thread. 
If there are functional benefits to decomposing operations further, then we can 
and should explore those, but the risk of regression alone outweighs the 
benefits of doing that solely for cosmetic benefits, not to mention the 
additional implementation complexity and hit in readability. I don't think this 
rules out the possibility of trying to add some more scope-based structure to 
herder requests so that, for example, every request is a series of single-scope 
method invocations, but I'm also not sure that that approach is realistic since 
it may increase the odds of inaccurate error messages.
   
   I'm also trying to optimize for user benefit with our error messages here. 
Telling someone that the worker is blocked reading to the end of the config 
topic is useful; telling someone that the worker is blocked while deleting a 
connector is less useful, especially if that error message is delivered in 
response to the request that triggered connector deletion. If we do opt for a 
different approach, I'd like it if we could add fine-grained error messages 
with the first PR, and not as a follow-up item.
   
   With regards to the so-called "smells":
   
   > I think having the stage be both a property of the callback and the herder 
was a bit odd; perhaps the HTTP request thread can just query the herder 
directly when the timeout occurs.
   
   I explored this approach initially, but it comes with a pretty severe 
drawback: there's no guarantee that a request that took too long timed out 
because it was blocked on the herder thread. Connector validation is the 
biggest example of this, though there's also connector restarts, which 
currently block on the connector actually completing startup. In  cases like 
these, it could be misleading to users to tell them what the herder thread is 
doing when that information has no bearing on why their request timed out.
   
   > Related, the linear "set the Stage for all queued requests" in seems bad 
to me. The queue is supposed to be small of course, but if timeouts are 
happening, it may be happening due to high tick thread contention/request load. 
In that situation, having every bit of progress on the tick thread need to 
update the queued requests is ultimately a quadratic cost, potentially making 
tick thread contention worse.
   
   I'm happy to do some benchmarking if we're worried about the performance 
impact of this approach. I'm personally a little skeptical that setting a new 
value for the `ConvertingFutureCallback.currentStage` field is likely to be 
significant, even if it is marked `volatile`. It's also a little strange to 
refer to this as a quadratic cost, since I can't envision a reasonable version 
of this approach that records any more than a few dozen herder tick thread 
stages, even with maximum granularity.
   
   > Setting the stage to null after completing some operation completes leaves 
the opportunity for race conditions to degrade the error messages. Between slow 
operations (presumably when the thread is Running) the error message wouldn't 
have the context for what slow operation just finished, even though if the 
timeout happened a fraction of a second earlier, it would be displayed. This is 
acceptable if the wait times are large, the time spent Running is small, and we 
never forget to add a Stage for a slow operation. But if any of those aren't 
true, the diagnostic power of this feature decreases.
   
   I think the key point here is that we need to be careful to record stages 
for potentially-slow operations, which I've tried to do here. But it's true 
that in the future, if we move things around and forget to wrap a blocking 
operation with, e.g., a `TickThreadStage`, then users will lose insight into 
operations. I haven't found a good alternative yet that doesn't run the risk of 
reporting incorrect operations. Perhaps we could augment the error message not 
just with when the operation started, but also, if applicable, when it 
completed? That way, in a racy scenario like one where we block for 89 seconds 
reading from the config topic and then time out a second after, users could 
still get reasonable insight into the problem ("the last thing the worker was 
working on was <some operation>, which began at <time> and ended at 
<time>"). Regardless, I do acknowledge this as a drawback of the current 
approach.
   
   > Statistically, the operation which is running when the timeout occurs is 
the most likely to have caused the timeout, but all of the actions running 
since the request started will contribute. To use an analogy: If you're 
experiencing slowness on your computer, you can sample the CPU and figure out 
that Program A happens to be running at that moment. But wouldn't it be more 
informative to know that over the past 30 seconds, 99% of the CPU was spent on 
program B, and you just got "unlucky" and observed program A was running? If 
you can only sample the CPU, you'd have to take multiple samples (issue 
multiple REST requests) to learn what the real problem was.
   
   Aha! I think the last sentence is actually a great, succinct way of 
justifying this approach. The first reaction most people have to a timeout 
message like this is to assume (or hope, or pray) that it's a transient 
problem, and issue one or more requests to retry the operation. The idea here 
is to provide more and more benefit as the number of retries increase. If you 
see a single timeout error and then the next request succeeds, it doesn't 
matter too much why the first one timed out. On the other hand, if your worker 
is borked and every request is hanging, then this approach (especially with the 
granularity it provides and the emphasis it makes on capturing 
potentially-blocking operations) gives a nice head start to diagnosing the 
problem without having to pore through worker logs first.
   
   Still, if we want to be even more informative, we could augment the 
currently-proposed error message not just with the current stage, but a history 
of stages and their start (and possibly end) times. IMO this is a bit much for 
the first patch and it'd be nice to wait for user feedback to see if it's 
actually necessary, but I wouldn't be opposed to it if others (including you) 
feel differently.
   
   
   Overall, I think it might be helpful to provide a few principles I've tried 
to satisfy with this approach and see if they're agreeable:
   
   - It's acceptable to sometimes provide no error message
   - It's much less acceptable to provide an inaccurate error message
   - We should optimize for cases when the worker has been blocked for a long 
time over cases when the first request times out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to