On Aug 29, 2012, at 8:55am, chraj007 wrote:
> Hello,
> Im trying to parse a file whose content type is UTF-16. Im unable to
> parse the document using the following code. Please Help me.
>
> ContentHandler textHandler = new BodyContentHandler();
> TeeContentHandler teeHandler = new
> TeeContentHandler(textHandler);
> parser.parse(input, teeHandler, metadata, context);
Note that you don't need to use a TeeContentHandler here.
> String tt = textHandler.toString();
> //to print the text
>
> byte[] converttoBytes = tt.getBytes("UTF-16");
> String string = new String(converttoBytes, "utf-8");
The above code won't do what I think you're hoping it will do.
The call to getBytes("UTF-16") will return the tt string as character data
encoded using UTF-16.
The second call says to generate a string from bytes that are character data
encoding using UTF-8 (which obviously isn't true).
> System.out.println(string);
>
> but its printing along with all html tags.
I'm unclear on what you mean by this.
But as Jukka noted in his response, the issue is that you have a document which
is encoded as UTF-8, but the HTML has <meta http-equiv="Content-Type"
content="text/html; charset=UTF-16">
Currently Tika treats this meta tag charset as the truth. See
https://issues.apache.org/jira/browse/TIKA-539 for a discussion on this issue.
Regards,
-- Ken
--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr