Hi all,
I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse <script> tags as links. When a <script> tags contains the "src" attribute, the attribute should specify a URI and the tag should not contain any content. Is there any particular reason the LinkContentHandler doesn't parse <script> tags, or is it just that I'm the first to look for this functionality? I can ping the dev mailing list too if necessary. Nutch's other built-in HTML parser collects all "outlinks", including <script> tags, but I'd prefer to use Tika and Boilerpipe. Thanks, Joe Naegele