html media types now come with charset info

Jukka Zitting Sun, 08 Jul 2012 16:18:46 -0700

Hi,

As of revision 1358858 Tika returns the detected character encoding as
a part of the content type metadata field. For example, instead of
"text/plain" the returned content type will be "text/plain;
charset=UTF-8" for a UTF-8 encoded text document.


This is conceptually correct (see TIKA-431), but may confuse some
clients that depend on the exact content type string with code like
this:

    String type = metadata.get(Metadata.CONTENT_TYPE);
    if ("text/html".equals(type)) { ... }

To fix such code, use the MediaType class to parse the returned
content type string:

    MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));
    if (type != null && "text/html".equals(type.getBaseType())) { ... }

Or instead of using direct string comparison, an ideal solution would
be to leverage the full type inheritance logic available in the media
type registry. With the isInstanceOf helper method I just added this
becomes:

    String type = metadata.get(Metadata.CONTENT_TYPE);
    MediaTypeRegistry registry = ...;
    if (registry.isInstanceOf(type, MediaType.TEXT_HTML)) { ... }

BR,

Jukka Zitting

FYI: text/plain and text/html media types now come with charset info

Reply via email to