james wrote:
Jarrett Billingsley Wrote:

On Tue, Jan 6, 2009 at 8:04 PM, james <jame...@gmail.com> wrote:
im writing an indexer, but im having a problem because on some file, when i 
read gives this error

Error 4: invalid UTF-8 sequence

is there a way to fix it.

You're probably reading a file that's encoded in some non-Unicode
encoding, like Latin-1.  You could read in the file data as byte[]
instead of as char[], but that still doesn't deal with the problem
that you have characters in your file that are outside the ASCII
range.  If you know what encoding your file uses, you could do some
transformations on it to turn it into valid Unicode, or you could just
ignore characters outside the ASCII range :P

is there any library or function that can automatically convert these unknown 
html charset into UTF-8

You mean that tries to work out what character set a file is in and then translates it?

(What is the current state of the art of character set detection heuristics?)

Stewart.

Reply via email to