Hi Joe,

I was looking at the version of this file in the (git) Tika-2.0 branch, not the 
(svn) trunk, and that change isn’t yet in 2.0 - my mistake.

I’d rolled in Markus’s patch directly to support these other link types, but I 
wish I’d remembered the old TIKA-503 discussion, as it would have been better 
to make that support conditional on using a different constructor, as it’s 
usually not a good idea to surprise consumers of parse output with new types of 
data (links).

I’ll take this discussion over to TIKA-1835 now.

— Ken 


> On Apr 5, 2016, at 12:53pm, Joseph Naegele <jnaeg...@grierforensics.com> 
> wrote:
> 
> Thanks Ken,
>  
> I'm confused though. The LinkContentHandler in 1.12 now collects <a>, <link>, 
> <iframe> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 
> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script 
> src="…"> belongs in there with the rest of them. What do you think?
>  
> Joe
>  
> From: Ken Krugler [mailto:kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>] 
> Sent: Tuesday, April 05, 2016 3:48 PM
> To: user@tika.apache.org <mailto:user@tika.apache.org>
> Subject: Re: script tags in LinkContentHandler
>  
> Hi Joe,
>  
>> On Apr 5, 2016, at 12:27pm, Joseph Naegele <jnaeg...@grierforensics.com 
>> <mailto:jnaeg...@grierforensics.com>> wrote:
>>  
>> Hi all,
>>  
>> I'm using Nutch for crawling the web, and one of its built-in HTML parsers 
>> uses Tika and its LinkContentHandler. I'm interested in collecting *all* 
>> links on a web page, but I'm surprised the LinkContentHandler doesn't parse 
>> <script> tags as links. When a <script> tags contains the "src" attribute, 
>> the attribute should specify a URI and the tag should not contain any 
>> content.
>>  
>> Is there any particular reason the LinkContentHandler doesn't parse <script> 
>> tags, or is it just that I'm the first to look for this functionality? I can 
>> ping the dev mailing list too if necessary.
>  
> I don’t think there’s a specific reason it’s not included, though see my 
> comment on https://issues.apache.org/jira/browse/TIKA-503 
> <https://issues.apache.org/jira/browse/TIKA-503>
>  
> e..g what about <link> elements?
>  
> — Ken
> 
> 
>>  
>> Nutch's other built-in HTML parser collects all "outlinks", including 
>> <script> tags, but I'd prefer to use Tika and Boilerpipe.
>>  
>> Thanks,
>> Joe Naegele
> 
> ----------------

Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to