I've run into an issue with extracting links from <frame src="xxx"> elements inside of a <frameset>. There are two problems:

1. Currently <frameset> and <frame> elements are discarded.

2. If I fix #1, then XHTMLContentHandler assumes <body>, so you get invalid XHTML that looks like:

<html>
        <body>
                <frameset>

I can tweak XHTMLContentHandler to do the right thing, but first wanted to see if anybody had an objection to emitting

<html>
        <frameset>
                ...

...for these cases.

This also probably won't do the right thing for busted HTML, as previously discussed on the list, where there's a <frameset> inside of a <body> in the original source - with a bit more work, I could probably handle that too, but probably not today.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to