Module submission HTML::Dirty

Perl Authors Upload Server Wed, 12 Dec 2001 16:51:41 -0800


The following module was proposed for inclusion in the Module List:


  modid:       HTML::Dirty
  DSLIP:       bdpOp
  description: Parser for dirty, messed up HTML
  userid:      MIKO (Miko O'Sullivan)
  chapterid:   15 (World_Wide_Web_HTML_HTTP_CGI)
  communities:

  similar:
    HTML::Parser

  rationale:

    HTML::Dirty was created when I was attempting to parse some pages
    on the web and HTML::Parser couldn't handle the sloppy,
    syntactically messed up pages it was running into. When I found a
    page that displayed several hundred links in Netscape and IE, but
    HTML::Parser only found two of them, I decided to grow my own.

    The concept of parsing HTML that is known to be non-conforming is,
    admittedly, almost a contradiction: if it's non-conforming, how do
    you know how to parse it? There are two answers to this question.
    First, HTML::Dirty doesn't attempt to build a full element tree out
    of the tags. It just creates an array of tokens representing the
    text, tags, endtags, declarations, and comments. I've found that the
    array is quite sufficient for my HTML parsing needs. Second,
    HTML::Dirty was designed to attempt to parse HTML in the same way
    the popular browsers do. Right or wrong, the popular browsers set
    the de facto standard of how HTML is written, and if you're going to
    attempt to parse HTML from public web pages you'll have to deal with
    the mess that's out there.

  enteredby:   MIKO (Miko O'Sullivan)
  enteredon:   Thu Dec 13 00:52:34 2001 GMT

The resulting entry would be:

HTML::
::Dirty           bdpOp Parser for dirty, messed up HTML             MIKO


Thanks for registering,
The Pause Team

PS: The following links are only valid for module list maintainers:

Registration form with editing capabilities:
  
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=58200000_b3bc0601f6901f31&SUBMIT_pause99_add_mod_preview=1
Immediate (one click) registration:
  
https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=58200000_b3bc0601f6901f31&SUBMIT_pause99_add_mod_insertit=1

Module submission HTML::Dirty

Reply via email to