Thanks, that makes sense, and I agree the raw signals are computed from
real data.

I probably phrased my point unclearly. I did not mean that the model is
guessing the PR count, commit count, reviewer diversity, or other measured
values. My concern is more about the next step: how those measured values
are interpreted for the report.

Another way to put it is: if we cannot explain how a human reviewer should
interpret a signal, I am not sure we should expect the model to make that
judgment reliably either. The model can help with summarising, drafting,
and surfacing candidate concerns, but the meaning of those signals still
needs to come from an explicit review policy or from a human reviewer.

So perhaps the goal should be to have the tool produce measured facts and
possible signals, have the prompt/rules explain how those signals should
usually be treated, and have the LLM mark uncertain cases as needing human
confirmation rather than turning them into firm conclusions.

One related thought: if the report normally needs the same categories of
input every time, perhaps the repeated multi-step MCP workflow should
eventually be wrapped in a higher-level tool.

For example, instead of asking the model to call several MCPs in sequence,
reconcile the results, and decide which signals matter, a tool could return
a report evidence pack: podling metadata, reporting window, release
evidence, computed health signals, mailing-list highlights, possible
concerns, and fields requiring human input.

The lower-level MCPs would still be useful for investigation, but the
standard report-generation path would become more auditable and less
dependent on the model orchestrating 5-10 calls correctly. The LLM could
then focus on the part it is better suited for: summarising less structured
information and turning the prepared evidence into readable draft prose.
Vladimir

Reply via email to