Re: script tags in LinkContentHandler

2016-04-06 Thread Ken Krugler
Hi Joe, In that case, I’d file a Jira issue with two test docs attached, one with a regular

RE: script tags in LinkContentHandler

2016-04-06 Thread Joseph Naegele
IdentityHtmlMapper solves the problem of elements being discarded. There is another problem with extracting

Re: script tags in LinkContentHandler

2016-04-06 Thread Ken Krugler
--Original Message- > From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] > Sent: Wednesday, April 06, 2016 4:14 PM > To: user@tika.apache.org > Subject: RE: script tags in LinkContentHandler > > Great, sounds good. Would you like me to open a ticket? > > With respe

Re: script tags in LinkContentHandler

2016-04-06 Thread Luís Filipe Nassif
:jnaeg...@grierforensics.com] > Sent: Wednesday, April 06, 2016 4:14 PM > To: user@tika.apache.org > Subject: RE: script tags in LinkContentHandler > > Great, sounds good. Would you like me to open a ticket? > > With respect to parsing outlinks in Nutch, there's actually two problems: > > 1)

RE: script tags in LinkContentHandler

2016-04-06 Thread Allison, Timothy B.
aegele [mailto:jnaeg...@grierforensics.com] Sent: Wednesday, April 06, 2016 4:14 PM To: user@tika.apache.org Subject: RE: script tags in LinkContentHandler Great, sounds good. Would you like me to open a ticket? With respect to parsing outlinks in Nutch, there's actually two problems: 1)

RE: script tags in LinkContentHandler

2016-04-06 Thread Markus Jelsma
normal elements. M. -Original message- > From:Joseph Naegele > Sent: Wednesday 6th April 2016 22:13 > To: user@tika.apache.org > Subject: RE: script tags in LinkContentHandler > > Great, sounds good. Would you like me to open a ticket? > > With respect to p

RE: script tags in LinkContentHandler

2016-04-06 Thread Joseph Naegele
Great, sounds good. Would you like me to open a ticket? With respect to parsing outlinks in Nutch, there's actually two problems: 1)

RE: script tags in LinkContentHandler

2016-04-06 Thread Markus Jelsma
Yes indeed! Script is missing and that's a mistake. See discussion at TIKA-1835. We should open a new ticket for it. Markus -Original message- > From:Ken Krugler > Sent: Tuesday 5th April 2016 22:24 > To: user@tika.apache.org > Subject: Re: script tags in LinkConte

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe, I was looking at the version of this file in the (git) Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my mistake. I’d rolled in Markus’s patch directly to support these other link types, but I wish I’d remembered the old TIKA-503 discussion, as it would have be

RE: script tags in LinkContentHandler

2016-04-05 Thread Joseph Naegele
Thanks Ken, I'm confused though. The LinkContentHandler in 1.12 now collects , , and , since https://issues.apache.org/jira/browse/TIKA-1835. In my opinion,

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe, > On Apr 5, 2016, at 12:27pm, Joseph Naegele > wrote: > > Hi all, > > I'm using Nutch for crawling the web, and one of its built-in HTML parsers > uses Tika and its LinkContentHandler. I'm interested in collecting *all* > links on a web page, but I'm surprised the LinkContentHandler

script tags in LinkContentHandler

2016-04-05 Thread Joseph Naegele
Hi all, I'm using Nutch for crawling the web, and one of its built-in HTML parsers uses Tika and its LinkContentHandler. I'm interested in collecting *all* links on a web page, but I'm surprised the LinkContentHandler doesn't parse