Hi Joe,
In that case, I’d file a Jira issue with two test docs attached, one with a
regular
IdentityHtmlMapper solves the problem of elements being discarded. There is
another problem with extracting
--Original Message-
> From: Joseph Naegele [mailto:jnaeg...@grierforensics.com]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
>
> Great, sounds good. Would you like me to open a ticket?
>
> With respe
:jnaeg...@grierforensics.com]
> Sent: Wednesday, April 06, 2016 4:14 PM
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
>
> Great, sounds good. Would you like me to open a ticket?
>
> With respect to parsing outlinks in Nutch, there's actually two problems:
>
> 1)
aegele [mailto:jnaeg...@grierforensics.com]
Sent: Wednesday, April 06, 2016 4:14 PM
To: user@tika.apache.org
Subject: RE: script tags in LinkContentHandler
Great, sounds good. Would you like me to open a ticket?
With respect to parsing outlinks in Nutch, there's actually two problems:
1)
normal elements.
M.
-Original message-
> From:Joseph Naegele
> Sent: Wednesday 6th April 2016 22:13
> To: user@tika.apache.org
> Subject: RE: script tags in LinkContentHandler
>
> Great, sounds good. Would you like me to open a ticket?
>
> With respect to p
Great, sounds good. Would you like me to open a ticket?
With respect to parsing outlinks in Nutch, there's actually two problems:
1)
Yes indeed! Script is missing and that's a mistake. See discussion at
TIKA-1835. We should open a new ticket for it.
Markus
-Original message-
> From:Ken Krugler
> Sent: Tuesday 5th April 2016 22:24
> To: user@tika.apache.org
> Subject: Re: script tags in LinkConte
Hi Joe,
I was looking at the version of this file in the (git) Tika-2.0 branch, not the
(svn) trunk, and that change isn’t yet in 2.0 - my mistake.
I’d rolled in Markus’s patch directly to support these other link types, but I
wish I’d remembered the old TIKA-503 discussion, as it would have be
Thanks Ken,
I'm confused though. The LinkContentHandler in 1.12 now collects , ,
and , since https://issues.apache.org/jira/browse/TIKA-1835. In
my opinion,
Hi Joe,
> On Apr 5, 2016, at 12:27pm, Joseph Naegele
> wrote:
>
> Hi all,
>
> I'm using Nutch for crawling the web, and one of its built-in HTML parsers
> uses Tika and its LinkContentHandler. I'm interested in collecting *all*
> links on a web page, but I'm surprised the LinkContentHandler
Hi all,
I'm using Nutch for crawling the web, and one of its built-in HTML parsers
uses Tika and its LinkContentHandler. I'm interested in collecting *all*
links on a web page, but I'm surprised the LinkContentHandler doesn't parse
12 matches
Mail list logo