TagSoup is notorious for being utterly unmaintained, but i can be forced to do
what, at least, i needed:
// We'll change the schema to allow tables inside anchors!
Schema schema = new HTMLSchema();
// Have meta reported everywhere, also in the body
schema.elementType("meta", HTMLSchema.M_EMPTY, 65535, 0);
// https://issues.apache.org/jira/browse/TIKA-985
String html5Elements[] = { "article", "aside", "audio", "bdi",
"command", "datalist", "details", "embed", "summary", "figure",
"figcaption", "footer", "header", "hgroup", "keygen", "mark",
"meter", "nav", "output", "progress", "section", "source", "time",
"track", "video", "figurecaption" };
for (String html5Element : html5Elements) {
schema.elementType(html5Element, HTMLSchema.M_ANY, 255, 0);
}
schema.elementType("a", HTMLSchema.M_ANY, 65535, 0);
// Set up a parse context
ParseContext context = new ParseContext();
context.set(Schema.class, schema);
context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
The changed HTMLSchema and the usage of IdentityHtmlMapper makes it possible to
return stuff that non-default TagSoup cannot.
Regards,
Markus
-----Original message-----
> From:Allison, Timothy B. <[email protected]>
> Sent: Friday 30th June 2017 17:13
> To: [email protected]
> Subject: RE: HTML parsing, script tags,
>
> Wait, Tagsoup is not returning the start element events in the same order as
> the html? I don’t know think we can fix that or your other points, but would
> you be willing to share triggering documents and open an issue for each
> problem.
> We should include those issues in our ongoing conversation about swapping out
> the underlying html parser for something more modern.
> Sorry Tika isn’t working for you on this, and thank you!
> From: Jim Idle [mailto:[email protected]]
> Sent: Friday, June 30, 2017 1:23 AM
> To: [email protected]
> Subject: RE: HTML parsing, script tags,
> Well I got a long way with the Tika wrapper around tag soup but then while
> chasing down a bug I realized that I was not getting the startElement events
> in the order that they are seen in the HTML file. It also ignores <!doctype>
> and unknown
> elements.
> I can’t see anyway to change that and as knowing the structure of the
> document is very important then I will have to stop using Tika for HTML I
> guess and go back to validator.nu
> Just posting this here for posterity really.
> Jim
> From: Ken Krugler [mailto:[email protected]
> <mailto:[email protected]>]
>
> Sent: Wednesday, June 28, 2017 23:06
> To: [email protected] <mailto:[email protected]>
> Subject: Re: HTML parsing, script tags,
> Hi Jim,
> On Jun 28, 2017, at 12:07am, Jim Idle <[email protected]
> <mailto:[email protected]>> wrote:
> So right now it looks the HTML parser only sends through script tags if the
> hay a src attribute. Is this likely to change or should I use another parser
> for HTML? I could submit a patch for this of course.
> You can use a custom mapper if you want to alter which tags get passed
> through.
> E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through
> everything.
> Also, does anyone have an opinion if the underlying tag soup stuff is
> tolerant of HTML in a similar manner to browsers which will try to render
> anything) or is expecting well-formed HTML. I can go look at the Tag Soup
> stuff directly of
> course, but just wondered if anyone has experience of using Tika to parse
>HTML.
> TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up
> broken HTML, with varying degrees of success, depending on the way that HTML
> is broken.
> — Ken
> --------------------------
> Ken Krugler
> 1 530-210-6378
> http://www.scaleunlimited.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scaleunlimited.com&d=DwMFaQ&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=zuXxc_gqb1VxiPCWTZMAcxEylZFKvjehEPUN183MkaM&s=CeitiWqk1nlp0ZL44NBYgX8weEIk24cx2yU7HA2AWFs&e=>custom
> big data solutions & training
> Hadoop, Cascading, Cassandra & Solr