Re: xml vs html parser

Jukka Zitting Tue, 16 Jun 2015 07:27:26 -0700

Hi,

2015-06-16 9:28 GMT-04:00 Allison, Timothy B. <[email protected]>:
> So, is there a way to make the XMLParser more lenient?


I don't think so. XML is draconian by design.

> Or is there a way to configure the HtmlParser to add spaces for
> non-html tags?

One option that wouldn't require changes in Tika code could be to use
HtmlParser with the IdentityHtmlMapper and process the output using
TextContentHandler with the addSpaceBetweenElements option enabled.

> Or, is there a better solution?

The cleanest alternative would be to come up with a more accurate
detection heuristics to detect SGML.

Are there some common file name patterns, DOCTYPEs or other easily
identifiable bits that could be used to improve the accuracy of type
detection?

Things like the <?xml ...?> header, presence of xmlns attributes, the
.xml file extension, etc. can be used as highly reliable signals for
XML content, so the lack of them coupled with even some fairly weak
SGML detection signals (stuff like upper case element names?) might be
enough to get significant improvements in this area.

BR,

Jukka Zitting

Re: xml vs html parser

Reply via email to