For what it's worth, I just committed some patches to Tika that should improve Tika's ability to extract HTML outlinks (in <img> and <frame> elements, at least). Support for <iframe> should be coming soon :)

This is in 0.8-SNAPSHOT, and there's one troubling parse issue I'm tracking down, but I think Tika is getting closer to being usable by Nutch for typical web crawling.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to