> On Apr 6, 2016, at 1:33pm, Allison, Timothy B. <talli...@mitre.org> wrote: > > On #2, I'd prefer not skipping elements. I definitely understand the use > case to extract what a human can see, but I suspect if your email address > ends in 'forensics.com', you'd probably like to see everything as well.
I’m not sure I see the issue. The _default_ implementation is for the parser to be configured to extract what a person can see, which is what you’d typically want. IdentityHtmlMapper is a way to be more lenient, in that it gets you back more of “stuff that can be rendered as valid XHTML 1.0”. But if you need access to a very specific element in the HTML, which isn’t text content, then what you really want to do is run the raw data through TagSoup/JSoup, then into Dom4J or equivalent, and use XPath queries to extract specific elements. — Ken > -----Original Message----- > From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] > Sent: Wednesday, April 06, 2016 4:14 PM > To: user@tika.apache.org > Subject: RE: script tags in LinkContentHandler > > Great, sounds good. Would you like me to open a ticket? > > With respect to parsing outlinks in Nutch, there's actually two problems: > > 1) <script> missing in LinkContentHandler > 2) HtmlParser's DefaultHtmlMapper considers <script> a discardable element so > it's discarded during the parse, similarly to <style>. > > Does anyone have opinions on #2? > > - Joe > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Wednesday, April 06, 2016 9:26 AM > To: user@tika.apache.org > Subject: RE: script tags in LinkContentHandler > > Yes indeed! Script is missing and that's a mistake. See discussion at > TIKA-1835. We should open a new ticket for it. > Markus > > > > -----Original message----- >> From:Ken Krugler <kkrugler_li...@transpac.com> >> Sent: Tuesday 5th April 2016 22:24 >> To: user@tika.apache.org >> Subject: Re: script tags in LinkContentHandler >> >> Hi Joe, >> <br class="" />I was looking at the version of this file in the (git) >> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my >> mistake. >> <br class="" />I’d rolled in Markus’s patch directly to support these other >> link types, but I wish I’d remembered the old TIKA-503 discussion, as it >> would have been better to make that support conditional on using a different >> constructor, as it’s usually not a good idea to surprise consumers of parse >> output with new types of data (links). >> <br class="" />I’ll take this discussion over to TIKA-1835 now. >> <br class="" />— Ken >> <br class="" /><br class="" />On Apr 5, 2016, at 12:53pm, Joseph Naegele >> <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote: >> <br class="Apple-interchange-newline" />Thanks Ken, I'm confused >> though. The LinkContentHandler in 1.12 now collects <a>, <link>, <iframe> >> and <img>, since https://issues.apache.org/jira/browse/TIKA-1835 >> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, <script >> src="…"> belongs in there with the rest of them. What do you think? >> Joe >> From: Ken Krugler [mailto:kkrugler_li...@transpac.com >> <mailto:kkrugler_li...@transpac.com>] <br class="" />Sent: Tuesday, >> April 05, 2016 3:48 PM<br class="" />To: user@tika.apache.org >> <mailto:user@tika.apache.org><br class="" />Subject: Re: script tags in >> LinkContentHandler Hi Joe, On Apr 5, 2016, at 12:27pm, Joseph Naegele >> <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote: >> Hi all, >> I'm using Nutch for crawling the web, and one of its built-in HTML parsers >> uses Tika and its LinkContentHandler. I'm interested in collecting *all* >> links on a web page, but I'm surprised the LinkContentHandler doesn't parse >> <script> tags as links. When a <script> tags contains the "src" attribute, >> the attribute should specify a URI and the tag should not contain any >> content. >> Is there any particular reason the LinkContentHandler doesn't parse <script> >> tags, or is it just that I'm the first to look for this functionality? I can >> ping the dev mailing list too if necessary. >> I don’t think there’s a specific reason it’s not included, though see my >> comment on https://issues.apache.org/jira/browse/TIKA-503 >> <https://issues.apache.org/jira/browse/TIKA-503>e..g what about <link> >> elements? >> — Ken >> <br class="" /><br class="" />Nutch's other built-in HTML parser collects >> all "outlinks", including <script> tags, but I'd prefer to use Tika and >> Boilerpipe. >> Thanks, >> Joe Naegele >> ---------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com <http://www.scaleunlimited.com>custom >> big data solutions & training Hadoop, Cascading, Cassandra & Solr <br >> class="Apple-interchange-newline" /><br >> class="Apple-interchange-newline" /><br >> class="Apple-interchange-newline" /><br >> class="Apple-interchange-newline" /> > >> <br class="" /> > -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr