Hi Joe, I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake.
I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have been better to make that support conditional on using a different constructor, as it’s usually not a good idea to surprise consumers of parse output with new types of data (links). I’ll take this discussion over to TIKA-1835 now. — Ken > On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaeg...@grierforensics.com> > wrote: > > Thanks Ken, > > I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, > <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 > <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script > src="…"> belongs in there with the rest of them. What do you think? > > Joe > > From: Ken Krugler [mailto:kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>] > Sent: Tuesday, April 05, 2016 3:48 PM > To: user@tika.apache.org <mailto:user@tika.apache.org> > Subject: Re: script tags in LinkContentHandler > > Hi Joe, > >> On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaeg...@grierforensics.com >> <mailto:jnaeg...@grierforensics.com>> wrote: >> >> Hi all, >> >> I'm using Nutch for crawling the web, and one of its built-in HTML parsers >> uses Tika and its LinkContentHandler. I'm interested in collecting *all* >> links on a web page, but I'm surprised the LinkContentHandler doesn't parse >> <script> tags as links. When a <script> tags contains the "src" attribute, >> the attribute should specify a URI and the tag should not contain any >> content. >> >> Is there any particular reason the LinkContentHandler doesn't parse <script> >> tags, or is it just that I'm the first to look for this functionality? I can >> ping the dev mailing list too if necessary. > > I don’t think there’s a specific reason it’s not included, though see my > comment on https://issues.apache.org/jira/browse/TIKA-503 > <https://issues.apache.org/jira/browse/TIKA-503> > > e..g what about <link> elements? > > — Ken > > >> >> Nutch's other built-in HTML parser collects all "outlinks", including >> <script> tags, but I'd prefer to use Tika and Boilerpipe. >> >> Thanks, >> Joe Naegele > > ---------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr