Re: script tags in LinkContentHandler

2016-04-06 Thread Luís Filipe Nassif
Hi,

I'm one of those from forensic world and, of course, my use case needs to
extract everything.

I have already tried IdentityHtmlMapper to extract "value" attributes from
"input" elements with no luck. It is not extracted by DefaultHtmlMapper and
is rendered by browsers, so I think DefaultHtmlMapper needs some improvement.
But HtmlMapper is the correct place to configure that or something must be
done with HTMLSchema (I've tried that too, but I am not a html expert)?

Thanks,
Luis

2016-04-06 17:33 GMT-03:00 Allison, Timothy B. <talli...@mitre.org>:

> On #2, I'd prefer not skipping elements.  I definitely understand the use
> case to extract what a human can see, but I suspect if your email address
> ends in 'forensics.com', you'd probably like to see everything as well.
>
> -Original Message-
> From: Joseph Naegele [mailto:jnaeg...@grierforensics.com]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
>
> Great, sounds good. Would you like me to open a ticket?
>
> With respect to parsing outlinks in Nutch, there's actually two problems:
>
> 1) 

RE: script tags in LinkContentHandler

2016-04-06 Thread Allison, Timothy B.
On #2, I'd prefer not skipping elements.  I definitely understand the use case 
to extract what a human can see, but I suspect if your email address ends in 
'forensics.com', you'd probably like to see everything as well.

-Original Message-
From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] 
Sent: Wednesday, April 06, 2016 4:14 PM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler

Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) 

RE: script tags in LinkContentHandler

2016-04-06 Thread Markus Jelsma
Hello! Yes, please open a ticket for it.

As for 2, in Nutch, you can instruct the Tika parser to use a different 
HtmlMapper. Use IdentityHtmlMapper! I forgot the property, but look it up in 
TikaParser.java, it is near the bottom. The default mapper is bad indeed if you 
want to grab stuff from normal elements.

M.

 
 
-Original message-
> From:Joseph Naegele <jnaeg...@grierforensics.com>
> Sent: Wednesday 6th April 2016 22:13
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
> 
> Great, sounds good. Would you like me to open a ticket?
> 
> With respect to parsing outlinks in Nutch, there's actually two problems:
> 
> 1) 

RE: script tags in LinkContentHandler

2016-04-06 Thread Joseph Naegele
Great, sounds good. Would you like me to open a ticket?

With respect to parsing outlinks in Nutch, there's actually two problems:

1) 

RE: script tags in LinkContentHandler

2016-04-06 Thread Markus Jelsma
Yes indeed! Script is missing and that's a mistake. See discussion at 
TIKA-1835. We should open a new ticket for it.
Markus

 
 
-Original message-
> From:Ken Krugler <kkrugler_li...@transpac.com>
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkContentHandler
> 
> Hi Joe, 
> I was looking at the version of this file in the (git) 
> Tika-2.0 branch, not the (svn) trunk, and that change isn’t yet in 2.0 - my 
> mistake. 
> I’d rolled in Markus’s patch directly to support these other 
> link types, but I wish I’d remembered the old TIKA-503 discussion, as it 
> would have been better to make that support conditional on using a different 
> constructor, as it’s usually not a good idea to surprise consumers of parse 
> output with new types of data (links). 
> I’ll take this discussion over to TIKA-1835 now. 
> — Ken  
> On Apr 5, 2016, at 12:53pm, Joseph Naegele 
> <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com>> wrote: 
> Thanks Ken, 
> I'm confused though. The LinkContentHandler in 1.12 now collects , , 
>  and , since https://issues.apache.org/jira/browse/TIKA-1835 
> <https://issues.apache.org/jira/browse/TIKA-1835>. In my opinion, 

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe,

I was looking at the version of this file in the (git) Tika-2.0 branch, not the 
(svn) trunk, and that change isn’t yet in 2.0 - my mistake.

I’d rolled in Markus’s patch directly to support these other link types, but I 
wish I’d remembered the old TIKA-503 discussion, as it would have been better 
to make that support conditional on using a different constructor, as it’s 
usually not a good idea to surprise consumers of parse output with new types of 
data (links).

I’ll take this discussion over to TIKA-1835 now.

— Ken 


> On Apr 5, 2016, at 12:53pm, Joseph Naegele  
> wrote:
> 
> Thanks Ken,
>  
> I'm confused though. The LinkContentHandler in 1.12 now collects , , 
>  and , since https://issues.apache.org/jira/browse/TIKA-1835 
> . In my opinion, 

RE: script tags in LinkContentHandler

2016-04-05 Thread Joseph Naegele
Thanks Ken,

 

I'm confused though. The LinkContentHandler in 1.12 now collects , , 
 and , since https://issues.apache.org/jira/browse/TIKA-1835. In 
my opinion, 

Re: script tags in LinkContentHandler

2016-04-05 Thread Ken Krugler
Hi Joe,

> On Apr 5, 2016, at 12:27pm, Joseph Naegele  
> wrote:
> 
> Hi all,
>  
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers 
> uses Tika and its LinkContentHandler. I'm interested in collecting *all* 
> links on a web page, but I'm surprised the LinkContentHandler doesn't parse 
> 

script tags in LinkContentHandler

2016-04-05 Thread Joseph Naegele
Hi all,

 

I'm using Nutch for crawling the web, and one of its built-in HTML parsers
uses Tika and its LinkContentHandler. I'm interested in collecting *all*
links on a web page, but I'm surprised the LinkContentHandler doesn't parse