C0urante commented on PR #14562: URL: https://github.com/apache/kafka/pull/14562#issuecomment-1771701339
Thanks Greg. To be clear, this isn't really a first pass--it's the first published one :) With regards to giving each request a single scope--this would fail to capture blocking operations outside of the request (such as other requests, ensuring group membership, and reading to the end of the config topic). I suppose we could do something similar where we decompose `tick()` into a series of either 1) method invocations that have a single scope or 2) herder requests that have their own scopes (at least one, possibly more). As far as breaking down herder requests into several smaller requests goes, I don't think it's wise to do this when not absolutely necessary. The only reason we added that kind of decomposition in https://github.com/apache/kafka/pull/8069 was to avoid blocking on interactions with connector plugins that could otherwise disable the herder's work thread. If there are functional benefits to decomposing operations further, then we can and should explore those, but the risk of regression alone outweighs the benefits of doing that solely for cosmetic benefits, not to mention the additional implementation complexity and hit in readability. I don't think this rules out the possibility of trying to add some more scope-based structure to herder requests so that, for example, every request is a series of single-scope method invocations, but I'm also not sure that that approach is realistic since it may increase the odds of inaccurate error messages. I'm also trying to optimize for user benefit with our error messages here. Telling someone that the worker is blocked reading to the end of the config topic is useful; telling someone that the worker is blocked while deleting a connector is less useful, especially if that error message is delivered in response to the request that triggered connector deletion. If we do opt for a different approach, I'd like it if we could add fine-grained error messages with the first PR, and not as a follow-up item. With regards to the so-called "smells": > I think having the stage be both a property of the callback and the herder was a bit odd; perhaps the HTTP request thread can just query the herder directly when the timeout occurs. I explored this approach initially, but it comes with a pretty severe drawback: there's no guarantee that a request that took too long timed out because it was blocked on the herder thread. Connector validation is the biggest example of this, though there's also connector restarts, which currently block on the connector actually completing startup. In cases like these, it could be misleading to users to tell them what the herder thread is doing when that information has no bearing on why their request timed out. > Related, the linear "set the Stage for all queued requests" in seems bad to me. The queue is supposed to be small of course, but if timeouts are happening, it may be happening due to high tick thread contention/request load. In that situation, having every bit of progress on the tick thread need to update the queued requests is ultimately a quadratic cost, potentially making tick thread contention worse. I'm happy to do some benchmarking if we're worried about the performance impact of this approach. I'm personally a little skeptical that setting a new value for the `ConvertingFutureCallback.currentStage` field is likely to be significant, even if it is marked `volatile`. It's also a little strange to refer to this as a quadratic cost, since I can't envision a reasonable version of this approach that records any more than a few dozen herder tick thread stages, even with maximum granularity. > Setting the stage to null after completing some operation completes leaves the opportunity for race conditions to degrade the error messages. Between slow operations (presumably when the thread is Running) the error message wouldn't have the context for what slow operation just finished, even though if the timeout happened a fraction of a second earlier, it would be displayed. This is acceptable if the wait times are large, the time spent Running is small, and we never forget to add a Stage for a slow operation. But if any of those aren't true, the diagnostic power of this feature decreases. I think the key point here is that we need to be careful to record stages for potentially-slow operations, which I've tried to do here. But it's true that in the future, if we move things around and forget to wrap a blocking operation with, e.g., a `TickThreadStage`, then users will lose insight into operations. I haven't found a good alternative yet that doesn't run the risk of reporting incorrect operations. Perhaps we could augment the error message not just with when the operation started, but also, if applicable, when it completed? That way, in a racy scenario like one where we block for 89 seconds reading from the config topic and then time out a second after, users could still get reasonable insight into the problem ("the last thing the worker was working on was <some operation>, which began at <time> and ended at <time>"). Regardless, I do acknowledge this as a drawback of the current approach. > Statistically, the operation which is running when the timeout occurs is the most likely to have caused the timeout, but all of the actions running since the request started will contribute. To use an analogy: If you're experiencing slowness on your computer, you can sample the CPU and figure out that Program A happens to be running at that moment. But wouldn't it be more informative to know that over the past 30 seconds, 99% of the CPU was spent on program B, and you just got "unlucky" and observed program A was running? If you can only sample the CPU, you'd have to take multiple samples (issue multiple REST requests) to learn what the real problem was. Aha! I think the last sentence is actually a great, succinct way of justifying this approach. The first reaction most people have to a timeout message like this is to assume (or hope, or pray) that it's a transient problem, and issue one or more requests to retry the operation. The idea here is to provide more and more benefit as the number of retries increase. If you see a single timeout error and then the next request succeeds, it doesn't matter too much why the first one timed out. On the other hand, if your worker is borked and every request is hanging, then this approach (especially with the granularity it provides and the emphasis it makes on capturing potentially-blocking operations) gives a nice head start to diagnosing the problem without having to pore through worker logs first. Still, if we want to be even more informative, we could augment the currently-proposed error message not just with the current stage, but a history of stages and their start (and possibly end) times. IMO this is a bit much for the first patch and it'd be nice to wait for user feedback to see if it's actually necessary, but I wouldn't be opposed to it if others (including you) feel differently. Overall, I think it might be helpful to provide a few principles I've tried to satisfy with this approach and see if they're agreeable: - It's acceptable to sometimes provide no error message - It's much less acceptable to provide an inaccurate error message - We should optimize for cases when the worker has been blocked for a long time over cases when the first request times out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org