Re: Script tag contents not always reported in ContentHandler

Tim Allison Thu, 30 May 2024 05:53:07 -0700

Markus,
  I'm sorry for my delay. We're migrating to jsoup in 3.x. I realize that
3.x isn't out yet, but I wanted to give you a heads up.


  To extract scripts in 3.x, you'd do something like this:
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/resources/org/apache/tika/parser/html/tika-config.xml

  You should be able to swap in the HtmlParser for the JsoupParser in that
config and be good to go.

  Are you able to share an example html with me, even if only privately? I
_think_ we have a unit test for script handling in 2.x and 3.x, and it
_should_ work.

      Best,

                Tim

On Wed, May 29, 2024 at 9:37 AM Markus Jelsma <[email protected]>
wrote:

> So i found HtmlParser.setExtractScripts(),this sounds very promising!
> Changed the code to use HtmlParser instead of AutoDetectParser and set the
> flag to true. Unforuntately, the script's contents were still not reported
> in the characters method. No idea why.
>
> I also found TagSoup's Parser.*CDATAElementsFeature
> <https://javadoc.io/static/org.ccil.cowan.tagsoup/tagsoup/1.2.1/org/ccil/cowan/tagsoup/Parser.html#CDATAElementsFeature>*
> constant. Seems to be the same as:
> http://www.ccil.org/~cowan/tagsoup/features/cdata-elementsA value of
> "true" indicates that the parser will process the script and style
> elements (or any elements with type='cdata' in the TSSL schema) as SGML
> CDATA elements (that is, no markup is recognized except the matching
> end-tag).
>
> Sounds promising, well, at least something to try. But how do we exactly
> set that parameter from code or in tika-config.xml if that is better. It
> isn't really obvious at the moment.
>
> Many thanks,
> Markus
>
>
>
> Op di 28 mei 2024 om 12:19 schreef Markus Jelsma <
> [email protected]>:
>
>> Hello,
>>
>> We're using Tika to parse HTML via a custom ContentHandler. This works
>> really well. Except that in some cases we do not get the contents of script
>> tags in the head reported in the characters() method in the ContentHandler.
>>
>> We're using this code:
>> TikaConfig tikaConfig = new
>> TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml"));
>> Schema schema = new HTMLSchema();
>> ParseContext context = new ParseContext();
>> context.set(Schema.class, schema);
>> context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
>> Metadata metadata = new Metadata();
>> ReadableContentHandler handler = new ReadableContentHandler(url, config);
>> AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>> InputStream stream = SAXTestCase.class.getResourceAsStream(path);
>> parser.parse(stream, handler, metadata, context);
>>
>> If we fiddle with TagSoup's Schema we do see some bad examples suddenly
>> report the characters of the script tag. But, as in good tradition, other
>> stuff breaks and things like meta fields in some other HTML examples no
>> longer get reported.
>>
>> schema.elementType("script", HTMLSchema.M_ANY, 255, 0);
>>
>> Now, i don't even know if changing the schema is a good idea, or if there
>> is some other setting in Tika i do not know or forgot about.
>>
>> Anyone here having some ideas?
>>
>> Thanks,
>> Markus
>>
>

Re: Script tag contents not always reported in ContentHandler

Reply via email to