On Mon, Sep 1, 2008 at 11:15 AM, Phil Archer <[EMAIL PROTECTED]> wrote: > Hi, > > I've used LWP in several apps in which the key bit of information I'm after > is the headers. I've therefore got used to the fact that if the returned > resource is HTML, one of the triggers for "OK, that's all the headers and > everything else must be content" is the presence of anything in the <head> > section of the document that LWP doesn't recognise. > > Take this, for example: > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html > xmlns:creativeCommons='http://backend.userland.com/creativeCommonsRssModule' > xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"> > > <creativeCommons:license>http://creativecommons.org/licenses/by-nc-nd/3.0/</creativeCommons:license> > > <head profile="http://gmpg.org/xfn/11"> > ... > > Perfectly valid XHTML - but... LWP doesn't recognise the <creativecommons... > tag and so stops parsing the headers. > > The User Agent package I'm using is version 2.31 > > So, some questions: > > 1. Which modules need updating so that LWP can recognise this kind of thing > as valid <head> content
It's the HTML::HeadParser module in the HTML-Parser dist. > 2. Has anyone written such a module? Not that I know about. > > As a demonstration, [1] and [2] show the status line, headers_as_string and > content from two versions of the same document, the only difference between > the two being that in [2], the <creativecommons..> tag is commented out. You > can get this output from any URI using the form at [3]. > > Thanks for any help > > Phil. > > [1] > http://www.icra.org/cgi-bin/HTTP_Headers.cgi?url=http%3A%2F%2Fwww.icra.org%2Flabel%2FHTTP-Test%2Fspace.htm > [2] > http://www.icra.org/cgi-bin/HTTP_Headers.cgi?url=http%3A%2F%2Fwww.icra.org%2Flabel%2FHTTP-Test%2Fspace-mod.htm > [3] http://www.icra.org/label/HTTP-Test/