So i found HtmlParser.setExtractScripts(),this sounds very promising! Changed the code to use HtmlParser instead of AutoDetectParser and set the flag to true. Unforuntately, the script's contents were still not reported in the characters method. No idea why.
I also found TagSoup's Parser.*CDATAElementsFeature <https://javadoc.io/static/org.ccil.cowan.tagsoup/tagsoup/1.2.1/org/ccil/cowan/tagsoup/Parser.html#CDATAElementsFeature>* constant. Seems to be the same as: http://www.ccil.org/~cowan/tagsoup/features/cdata-elementsA value of "true" indicates that the parser will process the script and style elements (or any elements with type='cdata' in the TSSL schema) as SGML CDATA elements (that is, no markup is recognized except the matching end-tag). Sounds promising, well, at least something to try. But how do we exactly set that parameter from code or in tika-config.xml if that is better. It isn't really obvious at the moment. Many thanks, Markus Op di 28 mei 2024 om 12:19 schreef Markus Jelsma <[email protected] >: > Hello, > > We're using Tika to parse HTML via a custom ContentHandler. This works > really well. Except that in some cases we do not get the contents of script > tags in the head reported in the characters() method in the ContentHandler. > > We're using this code: > TikaConfig tikaConfig = new > TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml")); > Schema schema = new HTMLSchema(); > ParseContext context = new ParseContext(); > context.set(Schema.class, schema); > context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE); > Metadata metadata = new Metadata(); > ReadableContentHandler handler = new ReadableContentHandler(url, config); > AutoDetectParser parser = new AutoDetectParser(tikaConfig); > InputStream stream = SAXTestCase.class.getResourceAsStream(path); > parser.parse(stream, handler, metadata, context); > > If we fiddle with TagSoup's Schema we do see some bad examples suddenly > report the characters of the script tag. But, as in good tradition, other > stuff breaks and things like meta fields in some other HTML examples no > longer get reported. > > schema.elementType("script", HTMLSchema.M_ANY, 255, 0); > > Now, i don't even know if changing the schema is a good idea, or if there > is some other setting in Tika i do not know or forgot about. > > Anyone here having some ideas? > > Thanks, > Markus >
