On Jan 16, 2008, at 12:41 AM, Kevin Burton wrote:

Has anyone done any large scale audits of XHTML in the wild to
determine the percentage that parse correctly?

Yes, Ian Hickson at Google did a survey of about 1B pages and found that over 90% had *well-formedness* errors. I can't find a reference off hand, but it maybe buried somewhere in [#webstats].

I'm thinking about deploying one in Spinn3r but I'd rather focus on
other tasks if this has already been done.

I'd suggest working on other tasks. :)

I'm curious about the assumptions one could make when assuming that
XHTML is well formed.

You know what they say about assumptions.

Specifically, the probability that a naive non-XML parser can make
while indexing the content.

I'm not sure what you mean here, but I'd reccomend against using an XML parser against web content and instead use something like the HTML5 parsing algorithm [#html5-parsing].

-ryan

[webstats]: http://code.google.com/webstats/
[html5-parsing]: http://whatwg.org/specs/web-apps/current-work/#parsing
_______________________________________________
microformats-discuss mailing list
microformats-discuss@microformats.org
http://microformats.org/mailman/listinfo/microformats-discuss

Reply via email to