Re: AutoDetectParser is not parsing UTF-16 content types

Ken Krugler Thu, 30 Aug 2012 14:39:48 -0700

On Aug 29, 2012, at 8:55am, chraj007 wrote:

> Hello,
>   Im trying to parse a file whose content type is UTF-16. Im unable to
> parse the document using the following code. Please Help me.
> 
>       ContentHandler textHandler = new BodyContentHandler();
>        TeeContentHandler teeHandler           =        new
> TeeContentHandler(textHandler);
>        parser.parse(input, teeHandler, metadata, context);


Note that you don't need to use a TeeContentHandler here.

>        String tt = textHandler.toString();
> //to print the text
> 
> byte[] converttoBytes = tt.getBytes("UTF-16");
>        String string = new String(converttoBytes, "utf-8");

The above code won't do what I think you're hoping it will do.

The call to getBytes("UTF-16") will return the tt string as character data 
encoded using UTF-16.

The second call says to generate a string from bytes that are character data 
encoding using UTF-8 (which obviously isn't true).

>       System.out.println(string);
> 
> but its printing along with all html tags.

I'm unclear on what you mean by this.

But as Jukka noted in his response, the issue is that you have a document which 
is encoded as UTF-8, but the HTML has <meta http-equiv="Content-Type" 
content="text/html; charset=UTF-16">

Currently Tika treats this meta tag charset as the truth. See 
https://issues.apache.org/jira/browse/TIKA-539 for a discussion on this issue.

Regards,

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: AutoDetectParser is not parsing UTF-16 content types

Reply via email to