Has anyone done any large scale audits of XHTML in the wild to
determine the percentage that parse correctly?

Yes, Ian Hickson at Google did a survey of about 1B pages and found that over 90% had *well-formedness* errors. I can't find a reference off hand, but it maybe buried somewhere in [#webstats].

I'm curious about the assumptions one could make when assuming that
XHTML is well formed.

Specifically, the probability that a naive non-XML parser can make
while indexing the content.

I'm not sure what you mean here, but I'd reccomend against using an XML parser against web content and instead use something like the HTML5 parsing algorithm [#html5-parsing].


[webstats]: http://code.google.com/webstats/
[html5-parsing]: http://whatwg.org/specs/web-apps/current-work/#parsing
