TagSoup is notorious for being utterly unmaintained, but i can be forced to do what, at least, i needed:
// We'll change the schema to allow tables inside anchors! Schema schema = new HTMLSchema(); // Have meta reported everywhere, also in the body schema.elementType("meta", HTMLSchema.M_EMPTY, 65535, 0); // https://issues.apache.org/jira/browse/TIKA-985 String html5Elements[] = { "article", "aside", "audio", "bdi", "command", "datalist", "details", "embed", "summary", "figure", "figcaption", "footer", "header", "hgroup", "keygen", "mark", "meter", "nav", "output", "progress", "section", "source", "time", "track", "video", "figurecaption" }; for (String html5Element : html5Elements) { schema.elementType(html5Element, HTMLSchema.M_ANY, 255, 0); } schema.elementType("a", HTMLSchema.M_ANY, 65535, 0); // Set up a parse context ParseContext context = new ParseContext(); context.set(Schema.class, schema); context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE); The changed HTMLSchema and the usage of IdentityHtmlMapper makes it possible to return stuff that non-default TagSoup cannot. Regards, Markus -----Original message----- > From:Allison, Timothy B. <talli...@mitre.org> > Sent: Friday 30th June 2017 17:13 > To: user@tika.apache.org > Subject: RE: HTML parsing, script tags, > > Wait, Tagsoup is not returning the start element events in the same order as > the html? I don’t know think we can fix that or your other points, but would > you be willing to share triggering documents and open an issue for each > problem. > We should include those issues in our ongoing conversation about swapping out > the underlying html parser for something more modern. > Sorry Tika isn’t working for you on this, and thank you! > From: Jim Idle [mailto:ji...@proofpoint.com] > Sent: Friday, June 30, 2017 1:23 AM > To: user@tika.apache.org > Subject: RE: HTML parsing, script tags, > Well I got a long way with the Tika wrapper around tag soup but then while > chasing down a bug I realized that I was not getting the startElement events > in the order that they are seen in the HTML file. It also ignores <!doctype> > and unknown > elements. > I can’t see anyway to change that and as knowing the structure of the > document is very important then I will have to stop using Tika for HTML I > guess and go back to validator.nu > Just posting this here for posterity really. > Jim > From: Ken Krugler [mailto:kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>] > > Sent: Wednesday, June 28, 2017 23:06 > To: user@tika.apache.org <mailto:user@tika.apache.org> > Subject: Re: HTML parsing, script tags, > Hi Jim, > On Jun 28, 2017, at 12:07am, Jim Idle <ji...@proofpoint.com > <mailto:ji...@proofpoint.com>> wrote: > So right now it looks the HTML parser only sends through script tags if the > hay a src attribute. Is this likely to change or should I use another parser > for HTML? I could submit a patch for this of course. > You can use a custom mapper if you want to alter which tags get passed > through. > E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through > everything. > Also, does anyone have an opinion if the underlying tag soup stuff is > tolerant of HTML in a similar manner to browsers which will try to render > anything) or is expecting well-formed HTML. I can go look at the Tag Soup > stuff directly of > course, but just wondered if anyone has experience of using Tika to parse >HTML. > TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up > broken HTML, with varying degrees of success, depending on the way that HTML > is broken. > — Ken > -------------------------- > Ken Krugler > 1 530-210-6378 > http://www.scaleunlimited.com > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scaleunlimited.com&d=DwMFaQ&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=zuXxc_gqb1VxiPCWTZMAcxEylZFKvjehEPUN183MkaM&s=CeitiWqk1nlp0ZL44NBYgX8weEIk24cx2yU7HA2AWFs&e=>custom > big data solutions & training > Hadoop, Cascading, Cassandra & Solr