The following module was proposed for inclusion in the Module List:
modid: HTML::Dirty
DSLIP: bdpOp
description: Parser for dirty, messed up HTML
userid: MIKO (Miko O'Sullivan)
chapterid: 15 (World_Wide_Web_HTML_HTTP_CGI)
communities:
similar:
HTML::Parser
rationale:
HTML::Dirty was created when I was attempting to parse some pages
on the web and HTML::Parser couldn't handle the sloppy,
syntactically messed up pages it was running into. When I found a
page that displayed several hundred links in Netscape and IE, but
HTML::Parser only found two of them, I decided to grow my own.
The concept of parsing HTML that is known to be non-conforming is,
admittedly, almost a contradiction: if it's non-conforming, how do
you know how to parse it? There are two answers to this question.
First, HTML::Dirty doesn't attempt to build a full element tree out
of the tags. It just creates an array of tokens representing the
text, tags, endtags, declarations, and comments. I've found that the
array is quite sufficient for my HTML parsing needs. Second,
HTML::Dirty was designed to attempt to parse HTML in the same way
the popular browsers do. Right or wrong, the popular browsers set
the de facto standard of how HTML is written, and if you're going to
attempt to parse HTML from public web pages you'll have to deal with
the mess that's out there.
enteredby: MIKO (Miko O'Sullivan)
enteredon: Thu Dec 13 00:52:34 2001 GMT
The resulting entry would be:
HTML::
::Dirty bdpOp Parser for dirty, messed up HTML MIKO
Thanks for registering,
The Pause Team
PS: The following links are only valid for module list maintainers:
Registration form with editing capabilities:
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=58200000_b3bc0601f6901f31&SUBMIT_pause99_add_mod_preview=1
Immediate (one click) registration:
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=58200000_b3bc0601f6901f31&SUBMIT_pause99_add_mod_insertit=1