On 2017-08-13 19:51, Adam D. Ruppe wrote:
On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
Just curious, but is there a spec of sorts which defines which errors
should be fixed and such?
The HTML5 spec describes how you are supposed to parse various things,
including the recovery paths for broken markup.
My module, however, isn't so formal. I just used it for a web scraping
thing at work that hit a few hundred sites and fixed bugs as they came
up to give good enough results for me.... (one thing I found is a lot of
sites claiming to be UTF-8 are actually latin-1, so it validates and
falls back to handle that. My http thing, while buggier, is similar - I
hit a server once that ignored the accept gzip header and always sent it
anyway, so I had to handle that... and I noticed curl actually didn't!)
So on the one hand, there's surely still bugs and weird cases, but on
the other hand, it did get a fair chunk of real-world use so I am fairly
confident it will be ok for most things.
Sounds good!
(Althought following the spec would be the first step to a D html layout
engine :D )