Thanks Ken, that’s probably what I need. I was trying to find a Config class but it seems I need to use a different mapper as you say.
Jim From: Ken Krugler [mailto:[email protected]] Sent: Wednesday, June 28, 2017 23:06 To: [email protected] Subject: Re: HTML parsing, script tags, Hi Jim, On Jun 28, 2017, at 12:07am, Jim Idle <[email protected]<mailto:[email protected]>> wrote: So right now it looks the HTML parser only sends through script tags if the hay a src attribute. Is this likely to change or should I use another parser for HTML? I could submit a patch for this of course. You can use a custom mapper if you want to alter which tags get passed through. E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through everything. Also, does anyone have an opinion if the underlying tag soup stuff is tolerant of HTML in a similar manner to browsers which will try to render anything) or is expecting well-formed HTML. I can go look at the Tag Soup stuff directly of course, but just wondered if anyone has experience of using Tika to parse HTML. TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up broken HTML, with varying degrees of success, depending on the way that HTML is broken. — Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scaleunlimited.com&d=DwMFaQ&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=zuXxc_gqb1VxiPCWTZMAcxEylZFKvjehEPUN183MkaM&s=CeitiWqk1nlp0ZL44NBYgX8weEIk24cx2yU7HA2AWFs&e=> custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
