RE: HTML parsing, script tags,

2017-07-02 Thread Jim Idle
option as at the HTML parser for Tika. Hope that helps, Jim From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Friday, June 30, 2017 23:13 To: user@tika.apache.org Subject: RE: HTML parsing, script tags, Wait, Tagsoup is not returning the start element events in the same order as the

RE: HTML parsing, script tags,

2017-06-30 Thread Markus Jelsma
Type("a", HTMLSchema.M_ANY, 65535, 0);         // Set up a parse context     ParseContext context = new ParseContext();     context.set(Schema.class, schema);     context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE); The changed HTMLSchema and the usage of Ident

RE: HTML parsing, script tags,

2017-06-30 Thread Allison, Timothy B.
about swapping out the underlying html parser for something more modern. Sorry Tika isn’t working for you on this, and thank you! From: Jim Idle [mailto:ji...@proofpoint.com] Sent: Friday, June 30, 2017 1:23 AM To: user@tika.apache.org Subject: RE: HTML parsing, script tags, Well I got a long

RE: HTML parsing, script tags,

2017-06-29 Thread Jim Idle
Subject: Re: HTML parsing, script tags, Hi Jim, On Jun 28, 2017, at 12:07am, Jim Idle mailto:ji...@proofpoint.com>> wrote: So right now it looks the HTML parser only sends through script tags if the hay a src attribute. Is this likely to change or should I use another parser for HTML? I

RE: HTML parsing, script tags,

2017-06-28 Thread Jim Idle
Thanks Ken, that’s probably what I need. I was trying to find a Config class but it seems I need to use a different mapper as you say. Jim From: Ken Krugler [mailto:kkrugler_li...@transpac.com] Sent: Wednesday, June 28, 2017 23:06 To: user@tika.apache.org Subject: Re: HTML parsing, script tags

Re: HTML parsing, script tags,

2017-06-28 Thread Ken Krugler
Hi Jim, > On Jun 28, 2017, at 12:07am, Jim Idle wrote: > > So right now it looks the HTML parser only sends through script tags if the > hay a src attribute. Is this likely to change or should I use another parser > for HTML? I could submit a patch for this of course. You can use a custom map

HTML parsing, script tags,

2017-06-28 Thread Jim Idle
So right now it looks the HTML parser only sends through script tags if the hay a src attribute. Is this likely to change or should I use another parser for HTML? I could submit a patch for this of course. Also, does anyone have an opinion if the underlying tag soup stuff is tolerant of HTML in