Re: script tags in LinkContentHandler

Ken Krugler Wed, 06 Apr 2016 15:10:37 -0700

Hi Joe,

In that case, I’d file a Jira issue with two test docs attached, one with a 
regular <script> in the body, and another with <script> in the <head> section.


Regards,

— Ken

> On Apr 6, 2016, at 3:01pm, Joseph Naegele <jnaeg...@grierforensics.com> wrote:
> 
> IdentityHtmlMapper solves the problem of elements being discarded. There is 
> another problem with extracting <script> links however:
> 
> HtmlHandler only checks for "META", "BASE", and "LINK" within <head>. See the 
> "if (bodylevel == 0 && discardLevel == 0)" section in HtmlHandler's 
> startElement() method.
> 
> I don't mean to drag out this topic, but I only want to report actual issues. 
> In this case I think HtmlHandler is missing at least a check for "SCRIPT" 
> tags in the HTML header.
> 
> - Joe
> 
> 
> From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] 
> Sent: Wednesday, April 06, 2016 5:21 PM
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi,
> 
> I'm one of those from forensic world and, of course, my use case needs to 
> extract everything.
> 
> I have already tried IdentityHtmlMapper to extract "value" attributes from 
> "input" elements with no luck. It is not extracted by DefaultHtmlMapper and 
> is rendered by browsers, so I think DefaultHtmlMapper needs some improvement. 
> But HtmlMapper is the correct place to configure that or something must be 
> done with HTMLSchema (I've tried that too, but I am not a html expert)?
> 
> Thanks,
> Luis
> 
> 2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <talli...@mitre.org>:
> On #2, I'd prefer not skipping elements.  I definitely understand the use 
> case to extract what a human can see, but I suspect if your email address 
> ends in 'forensics.com', you'd probably like to see everything as well.
> 
> -----Original Message-----
> From: Joseph Naegele [mailto:jnaeg...@grierforensics.com]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Great, sounds good. Would you like me to open a ticket?
> 
> With respect to parsing outlinks in Nutch, there's actually two problems:
> 
> 1) <script> missing in LinkContentHandler
> 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so 
> it's discarded during the parse, similarly to <style>.
> 
> Does anyone have opinions on #2?
> 
> - Joe
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Wednesday, April 06, 2016 9:26 AM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Yes indeed! Script is missing and that's a mistake. See discussion at 
> TIKA-1835. We should open a new ticket for it.
> Markus
> 
> 
> 
> -----Original message-----
>> From:Ken Krugler <kkrugler_li...@transpac.com>
>> Sent: Tuesday 5th April 2016 22:24
>> To: user@tika.apache.org
>> Subject: Re: script tags in LinkContentHandler
>> 
>> Hi Joe,
>> <br class="" />I was looking at the version of this file in the (git) 
>> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my 
>> mistake.
>> <br class="" />I’d rolled in Markus’s patch directly to support these other 
>> link types, but I wish I’d remembered the old TIKA-503 discussion, as it 
>> would have been better to make that support conditional on using a different 
>> constructor, as it’s usually not a good idea to surprise consumers of parse 
>> output with new types of data (links).
>> <br class="" />I’ll take this discussion over to TIKA-1835 now.
>> <br class="" />— Ken
>> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele 
>> <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote:
>> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused
>> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> 
>> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 
>> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script 
>> src="…"> belongs in there with the rest of them. What do you think?
>> Joe
>> From: Ken Krugler [mailto:kkrugler_li...@transpac.com
>> <mailto:kkrugler_li...@transpac.com>] <br class="" />Sent: Tuesday,
>> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org 
>> <mailto:user@tika.apache.org><br class="" />Subject: Re: script tags in 
>> LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele 
>> <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote:
>> Hi all,
>> I'm using Nutch for crawling the web, and one of its built-in HTML parsers 
>> uses Tika and its LinkContentHandler. I'm interested in collecting *all* 
>> links on a web page, but I'm surprised the LinkContentHandler doesn't parse 
>> <script> tags as links. When a <script> tags contains the "src" attribute, 
>> the attribute should specify a URI and the tag should not contain any 
>> content.
>> Is there any particular reason the LinkContentHandler doesn't parse <script> 
>> tags, or is it just that I'm the first to look for this functionality? I can 
>> ping the dev mailing list too if necessary.
>> I don’t think there’s a specific reason it’s not included, though see my 
>> comment on https://issues.apache.org/jira/browse/TIKA-503 
>> <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> 
>> elements?
>> — Ken
>> <br class="" /><br class="" />Nutch's other built-in HTML parser collects 
>> all "outlinks", including <script> tags, but I'd prefer to use Tika and 
>> Boilerpipe.
>> Thanks,
>> Joe Naegele
> 



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: script tags in LinkContentHandler

Reply via email to