Hello,

We're using Tika to parse HTML via a custom ContentHandler. This works
really well. Except that in some cases we do not get the contents of script
tags in the head reported in the characters() method in the ContentHandler.

We're using this code:
TikaConfig tikaConfig = new
TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml"));
Schema schema = new HTMLSchema();
ParseContext context = new ParseContext();
context.set(Schema.class, schema);
context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
Metadata metadata = new Metadata();
ReadableContentHandler handler = new ReadableContentHandler(url, config);
AutoDetectParser parser = new AutoDetectParser(tikaConfig);
InputStream stream = SAXTestCase.class.getResourceAsStream(path);
parser.parse(stream, handler, metadata, context);

If we fiddle with TagSoup's Schema we do see some bad examples suddenly
report the characters of the script tag. But, as in good tradition, other
stuff breaks and things like meta fields in some other HTML examples no
longer get reported.

schema.elementType("script", HTMLSchema.M_ANY, 255, 0);

Now, i don't even know if changing the schema is a good idea, or if there
is some other setting in Tika i do not know or forgot about.

Anyone here having some ideas?

Thanks,
Markus

Reply via email to