Re: html fetcher/parser

Faux Amis via Digitalmars-d-learn Mon, 14 Aug 2017 16:17:01 -0700

On 2017-08-13 19:51, Adam D. Ruppe wrote:

On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
Just curious, but is there a spec of sorts which defines which errorsshould be fixed and such?
The HTML5 spec describes how you are supposed to parse various things,including the recovery paths for broken markup.
My module, however, isn't so formal. I just used it for a web scrapingthing at work that hit a few hundred sites and fixed bugs as they cameup to give good enough results for me.... (one thing I found is a lot ofsites claiming to be UTF-8 are actually latin-1, so it validates andfalls back to handle that. My http thing, while buggier, is similar - Ihit a server once that ignored the accept gzip header and always sent itanyway, so I had to handle that... and I noticed curl actually didn't!)
So on the one hand, there's surely still bugs and weird cases, but onthe other hand, it did get a fair chunk of real-world use so I am fairlyconfident it will be ok for most things.


Sounds good!

(Althought following the spec would be the first step to a D html layoutengine :D )

Re: html fetcher/parser

Reply via email to