RE: script tags in LinkContentHandler

Joseph Naegele Wed, 06 Apr 2016 15:01:50 -0700

IdentityHtmlMapper solves the problem of elements being discarded. There is 
another problem with extracting <script> links however:


HtmlHandler only checks for "META", "BASE", and "LINK" within <head>. See the 
"if (bodylevel == 0 && discardLevel == 0)" section in HtmlHandler's 
startElement() method.

I don't mean to drag out this topic, but I only want to report actual issues. 
In this case I think HtmlHandler is missing at least a check for "SCRIPT" tags 
in the HTML header.

- Joe


From: Luís Filipe Nassif [mailto:[email protected]] 
Sent: Wednesday, April 06, 2016 5:21 PM
To: [email protected]
Subject: Re: script tags in LinkContentHandler

Hi,

I'm one of those from forensic world and, of course, my use case needs to 
extract everything.

I have already tried IdentityHtmlMapper to extract "value" attributes from 
"input" elements with no luck. It is not extracted by DefaultHtmlMapper and is 
rendered by browsers, so I think DefaultHtmlMapper needs some improvement. But 
HtmlMapper is the correct place to configure that or something must be done 
with HTMLSchema (I've tried that too, but I am not a html expert)?

Thanks,
Luis

2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <[email protected]>:
On #2, I'd prefer not skipping elements.  I definitely understand the use case 
to extract what a human can see, but I suspect if your email address ends in 
'forensics.com', you'd probably like to see everything as well.

-----Original Message-----
From: Joseph Naegele [mailto:[email protected]]
Sent: Wednesday, April 06, 2016 4:14 PM
To: [email protected]
Subject: RE: script tags in LinkContentHandler

Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) <script> missing in LinkContentHandler
2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so 
it's discarded during the parse, similarly to <style>.

Does anyone have opinions on #2?

- Joe

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]]
Sent: Wednesday, April 06, 2016 9:26 AM
To: [email protected]
Subject: RE: script tags in LinkContentHandler

Yes indeed! Script is missing and that's a mistake. See discussion at 
TIKA-1835. We should open a new ticket for it.
Markus



-----Original message-----
> From:Ken Krugler <[email protected]>
> Sent: Tuesday 5th April 2016 22:24
> To: [email protected]
> Subject: Re: script tags in LinkContentHandler
>
> Hi Joe,
> <br class="" />I was looking at the version of this file in the (git) 
> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my 
> mistake.
> <br class="" />I’d rolled in Markus’s patch directly to support these other 
> link types, but I wish I’d remembered the old TIKA-503 discussion, as it 
> would have been better to make that support conditional on using a different 
> constructor, as it’s usually not a good idea to surprise consumers of parse 
> output with new types of data (links).
> <br class="" />I’ll take this discussion over to TIKA-1835 now.
> <br class="" />— Ken
> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele 
> <[email protected] <mailto:[email protected]>> wrote:
> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused
> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and 
> <img>, since https://issues.apache.org/jira/browse/TIKA-1835 
> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script 
> src="…"> belongs in there with the rest of them. What do you think?
> Joe
> From: Ken Krugler [mailto:[email protected]
> <mailto:[email protected]>] <br class="" />Sent: Tuesday,
> April 05, 2016 3:48 PM<br class="" />To: [email protected] 
> <mailto:[email protected]><br class="" />Subject: Re: script tags in 
> LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele 
> <[email protected] <mailto:[email protected]>> wrote:
> Hi all,
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers 
> uses Tika and its LinkContentHandler. I'm interested in collecting *all* 
> links on a web page, but I'm surprised the LinkContentHandler doesn't parse 
> <script> tags as links. When a <script> tags contains the "src" attribute, 
> the attribute should specify a URI and the tag should not contain any content.
> Is there any particular reason the LinkContentHandler doesn't parse <script> 
> tags, or is it just that I'm the first to look for this functionality? I can 
> ping the dev mailing list too if necessary.
> I don’t think there’s a specific reason it’s not included, though see my 
> comment on https://issues.apache.org/jira/browse/TIKA-503 
> <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> 
> elements?
> — Ken
> <br class="" /><br class="" />Nutch's other built-in HTML parser collects all 
> "outlinks", including <script> tags, but I'd prefer to use Tika and 
> Boilerpipe.
> Thanks,
> Joe Naegele
> ----------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom
> big data solutions & training Hadoop, Cascading, Cassandra & Solr <br
> class="Apple-interchange-newline" /><br
> class="Apple-interchange-newline" /><br
> class="Apple-interchange-newline" /><br
> class="Apple-interchange-newline" />

> <br class="" />

RE: script tags in LinkContentHandler

Reply via email to