script tags in LinkContentHandler

Joseph Naegele Tue, 05 Apr 2016 12:28:10 -0700

Hi all,


I'm using Nutch for crawling the web, and one of its built-in HTML parsers
uses Tika and its LinkContentHandler. I'm interested in collecting *all*
links on a web page, but I'm surprised the LinkContentHandler doesn't parse
<script> tags as links. When a <script> tags contains the "src" attribute,
the attribute should specify a URI and the tag should not contain any
content.

 

Is there any particular reason the LinkContentHandler doesn't parse <script>
tags, or is it just that I'm the first to look for this functionality? I can
ping the dev mailing list too if necessary.

 

Nutch's other built-in HTML parser collects all "outlinks", including
<script> tags, but I'd prefer to use Tika and Boilerpipe.

 

Thanks,

Joe Naegele

script tags in LinkContentHandler

Reply via email to