option as
at the HTML parser for Tika.
Hope that helps,
Jim
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Friday, June 30, 2017 23:13
To: user@tika.apache.org
Subject: RE: HTML parsing, script tags,
Wait, Tagsoup is not returning the start element events in the same order as
the
Type("a", HTMLSchema.M_ANY, 65535, 0);
// Set up a parse context
ParseContext context = new ParseContext();
context.set(Schema.class, schema);
context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
The changed HTMLSchema and the usage of Ident
about swapping out
the underlying html parser for something more modern.
Sorry Tika isn’t working for you on this, and thank you!
From: Jim Idle [mailto:ji...@proofpoint.com]
Sent: Friday, June 30, 2017 1:23 AM
To: user@tika.apache.org
Subject: RE: HTML parsing, script tags,
Well I got a long
Subject: Re: HTML parsing, script tags,
Hi Jim,
On Jun 28, 2017, at 12:07am, Jim Idle
mailto:ji...@proofpoint.com>> wrote:
So right now it looks the HTML parser only sends through script tags if the hay
a src attribute. Is this likely to change or should I use another parser for
HTML? I
Thanks Ken, that’s probably what I need. I was trying to find a Config class
but it seems I need to use a different mapper as you say.
Jim
From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Wednesday, June 28, 2017 23:06
To: user@tika.apache.org
Subject: Re: HTML parsing, script tags
Hi Jim,
> On Jun 28, 2017, at 12:07am, Jim Idle wrote:
>
> So right now it looks the HTML parser only sends through script tags if the
> hay a src attribute. Is this likely to change or should I use another parser
> for HTML? I could submit a patch for this of course.
You can use a custom map
So right now it looks the HTML parser only sends through script tags if the hay
a src attribute. Is this likely to change or should I use another parser for
HTML? I could submit a patch for this of course.
Also, does anyone have an opinion if the underlying tag soup stuff is tolerant
of HTML in