IdentityHtmlMapper solves the problem of elements being discarded. There is another problem with extracting <script> links however:
HtmlHandler only checks for "META", "BASE", and "LINK" within <head>. See the "if (bodylevel == 0 && discardLevel == 0)" section in HtmlHandler's startElement() method. I don't mean to drag out this topic, but I only want to report actual issues. In this case I think HtmlHandler is missing at least a check for "SCRIPT" tags in the HTML header. - Joe From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] Sent: Wednesday, April 06, 2016 5:21 PM To: user@tika.apache.org Subject: Re: script tags in LinkContentHandler Hi, I'm one of those from forensic world and, of course, my use case needs to extract everything. I have already tried IdentityHtmlMapper to extract "value" attributes from "input" elements with no luck. It is not extracted by DefaultHtmlMapper and is rendered by browsers, so I think DefaultHtmlMapper needs some improvement. But HtmlMapper is the correct place to configure that or something must be done with HTMLSchema (I've tried that too, but I am not a html expert)? Thanks, Luis 2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <talli...@mitre.org>: On #2, I'd prefer not skipping elements. I definitely understand the use case to extract what a human can see, but I suspect if your email address ends in 'forensics.com', you'd probably like to see everything as well. -----Original Message----- From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] Sent: Wednesday, April 06, 2016 4:14 PM To: user@tika.apache.org Subject: RE: script tags in LinkContentHandler Great, sounds good. Would you like me to open a ticket? With respect to parsing outlinks in Nutch, there's actually two problems: 1) <script> missing in LinkContentHandler 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so it's discarded during the parse, similarly to <style>. Does anyone have opinions on #2? - Joe -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, April 06, 2016 9:26 AM To: user@tika.apache.org Subject: RE: script tags in LinkContentHandler Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it. Markus -----Original message----- > From:Ken Krugler <kkrugler_li...@transpac.com> > Sent: Tuesday 5th April 2016 22:24 > To: user@tika.apache.org > Subject: Re: script tags in LinkContentHandler > > Hi Joe, > <br class="" />I was looking at the version of this file in the (git) > Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my > mistake. > <br class="" />I’d rolled in Markus’s patch directly to support these other > link types, but I wish I’d remembered the old TIKA-503 discussion, as it > would have been better to make that support conditional on using a different > constructor, as it’s usually not a good idea to surprise consumers of parse > output with new types of data (links). > <br class="" />I’ll take this discussion over to TIKA-1835 now. > <br class="" />— Ken > <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele > <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote: > <br class="Apple-interchange-newline" />Thanks Ken, I'm confused > though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> and > <img>, since https://issues.apache.org/jira/browse/TIKA-1835 > <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script > src="…"> belongs in there with the rest of them. What do you think? > Joe > From: Ken Krugler [mailto:kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>] <br class="" />Sent: Tuesday, > April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org > <mailto:user@tika.apache.org><br class="" />Subject: Re: script tags in > LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele > <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote: > Hi all, > I'm using Nutch for crawling the web, and one of its built-in HTML parsers > uses Tika and its LinkContentHandler. I'm interested in collecting *all* > links on a web page, but I'm surprised the LinkContentHandler doesn't parse > <script> tags as links. When a <script> tags contains the "src" attribute, > the attribute should specify a URI and the tag should not contain any content. > Is there any particular reason the LinkContentHandler doesn't parse <script> > tags, or is it just that I'm the first to look for this functionality? I can > ping the dev mailing list too if necessary. > I don’t think there’s a specific reason it’s not included, though see my > comment on https://issues.apache.org/jira/browse/TIKA-503 > <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> > elements? > — Ken > <br class="" /><br class="" />Nutch's other built-in HTML parser collects all > "outlinks", including <script> tags, but I'd prefer to use Tika and > Boilerpipe. > Thanks, > Joe Naegele > ---------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom > big data solutions & training Hadoop, Cascading, Cassandra & Solr <br > class="Apple-interchange-newline" /><br > class="Apple-interchange-newline" /><br > class="Apple-interchange-newline" /><br > class="Apple-interchange-newline" /> > <br class="" />