Hi,
As of revision 1358858 Tika returns the detected character encoding as
a part of the content type metadata field. For example, instead of
"text/plain" the returned content type will be "text/plain;
charset=UTF-8" for a UTF-8 encoded text document.
This is conceptually correct (see TIKA-431), but may confuse some
clients that depend on the exact content type string with code like
this:
String type = metadata.get(Metadata.CONTENT_TYPE);
if ("text/html".equals(type)) { ... }
To fix such code, use the MediaType class to parse the returned
content type string:
MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));
if (type != null && "text/html".equals(type.getBaseType())) { ... }
Or instead of using direct string comparison, an ideal solution would
be to leverage the full type inheritance logic available in the media
type registry. With the isInstanceOf helper method I just added this
becomes:
String type = metadata.get(Metadata.CONTENT_TYPE);
MediaTypeRegistry registry = ...;
if (registry.isInstanceOf(type, MediaType.TEXT_HTML)) { ... }
BR,
Jukka Zitting