"David Hofmann" <[EMAIL PROTECTED]> writes:

> I'm currently using your perl module for processing input from a
> spider I wrote.
> 
> The problem I'm encountering is some pages have <> in the title.
> 
> Example HTML:
> 
> <TITLE>274500 - XL: "Save Changes in <Bookname>" Prompt Even If No
> Changes Are Made</TITLE>
> 
> The result I get back is "XL: "Save Changes in ". Also the
> description, keywords and last-modified come back bank on these pages
> if they were after the title on the page.

It looks like most other browsers parse <title> stuff in what the
HTML::Parser sources call literal mode.  I've now applied the
following patch to my sources, but I'm not really sure this is a good
idea.  I might still decide to revert it before release.

Index: hparser.c
===================================================================
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.98
retrieving revision 2.99
diff -u -p -u -r2.98 -r2.99
--- hparser.c   11 Nov 2004 10:12:51 -0000      2.98
+++ hparser.c   15 Nov 2004 22:19:49 -0000      2.99
@@ -1,4 +1,4 @@
-/* $Id: hparser.c,v 2.98 2004/11/11 10:12:51 gisle Exp $
+/* $Id: hparser.c,v 2.99 2004/11/15 22:19:49 gisle Exp $
  *
  * Copyright 1999-2002, Gisle Aas
  * Copyright 1999-2000, Michael A. Chase
@@ -27,6 +27,7 @@ literal_mode_elem[] =
     {5, "style", 1},
     {3, "xmp", 1},
     {9, "plaintext", 1},
+    {5, "title", 0},
     {8, "textarea", 0},
     {0, 0, 0}
 };

The problem here is that other browsers seems to switch into a mode
where tags inside <title> is still recognized if no </title> end tag
was found in the document.  HTML-Parser does not have brains to do
something like this. It tries to parse the document in a stream-like
fashion, and buffering of it all to figure out what quirk-mode to
parse in does not seem attractive.

Regards,
Gisle

Reply via email to