Thanks, that makes sense, and I agree the raw signals are computed from real data.
I probably phrased my point unclearly. I did not mean that the model is guessing the PR count, commit count, reviewer diversity, or other measured values. My concern is more about the next step: how those measured values are interpreted for the report. Another way to put it is: if we cannot explain how a human reviewer should interpret a signal, I am not sure we should expect the model to make that judgment reliably either. The model can help with summarising, drafting, and surfacing candidate concerns, but the meaning of those signals still needs to come from an explicit review policy or from a human reviewer. So perhaps the goal should be to have the tool produce measured facts and possible signals, have the prompt/rules explain how those signals should usually be treated, and have the LLM mark uncertain cases as needing human confirmation rather than turning them into firm conclusions. One related thought: if the report normally needs the same categories of input every time, perhaps the repeated multi-step MCP workflow should eventually be wrapped in a higher-level tool. For example, instead of asking the model to call several MCPs in sequence, reconcile the results, and decide which signals matter, a tool could return a report evidence pack: podling metadata, reporting window, release evidence, computed health signals, mailing-list highlights, possible concerns, and fields requiring human input. The lower-level MCPs would still be useful for investigation, but the standard report-generation path would become more auditable and less dependent on the model orchestrating 5-10 calls correctly. The LLM could then focus on the part it is better suited for: summarising less structured information and turning the prepared evidence into readable draft prose. Vladimir
