Stepping back for a moment, I think we're facing two key issues here: The first key issue is that the docs for std.benchmark don't adequately explain Andre's intended charter/scope for it, it's methodology or the rationale for its methodology. So people see "benchmark" and they think "oh, ok, for timing stuff", but it appears to be intended as being for very specific use-cases. I think this entire discussion serves as evidence that, at the very least, it needs to communicate that scope/methodology/rationale better that it currently does. If all of us are having trouble "getting it", then others certainly will too.
Aside from that, there's the second key issue: whether the current intended scope is sufficient. Should it be more general in scope and not so specialized? Personally, I would tend to think do, and I think that seems to the the popular notion. But I don't know for sure. If it should be more generalized, than does it need to be so for the first iteration, or can it be done later after being added to phobos? That, I have no idea.