Jukka,
  Sorry for my delay.

addSpaceBetweenElements  ...exactly what I was looking for.  Thank you.

  I'll send an update after further analysis of the incorrectly identified 
files to see if we can tweak our mimes.

      Cheers,

              Tim

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Sent: Tuesday, June 16, 2015 10:26 AM
To: Tika Users
Subject: Re: xml vs html parser

Hi,

2015-06-16 9:28 GMT-04:00 Allison, Timothy B. <talli...@mitre.org>:
> So, is there a way to make the XMLParser more lenient?

I don't think so. XML is draconian by design.

> Or is there a way to configure the HtmlParser to add spaces for
> non-html tags?

One option that wouldn't require changes in Tika code could be to use
HtmlParser with the IdentityHtmlMapper and process the output using
TextContentHandler with the addSpaceBetweenElements option enabled.

> Or, is there a better solution?

The cleanest alternative would be to come up with a more accurate
detection heuristics to detect SGML.

Are there some common file name patterns, DOCTYPEs or other easily
identifiable bits that could be used to improve the accuracy of type
detection?

Things like the <?xml ...?> header, presence of xmlns attributes, the
.xml file extension, etc. can be used as highly reliable signals for
XML content, so the lack of them coupled with even some fairly weak
SGML detection signals (stuff like upper case element names?) might be
enough to get significant improvements in this area.

BR,

Jukka Zitting

Reply via email to