For what it's worth, I just committed some patches to Tika that should
improve Tika's ability to extract HTML outlinks (in <img> and <frame>
elements, at least). Support for <iframe> should be coming soon :)
This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm
tracking down, but I think Tika is getting closer to being usable by
Nutch for typical web crawling.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g