james wrote:
Jarrett Billingsley Wrote:
On Tue, Jan 6, 2009 at 8:04 PM, james <jame...@gmail.com> wrote:
im writing an indexer, but im having a problem because on some file, when i
read gives this error
Error 4: invalid UTF-8 sequence
is there a way to fix it.
You're probably reading a file that's encoded in some non-Unicode
encoding, like Latin-1. You could read in the file data as byte[]
instead of as char[], but that still doesn't deal with the problem
that you have characters in your file that are outside the ASCII
range. If you know what encoding your file uses, you could do some
transformations on it to turn it into valid Unicode, or you could just
ignore characters outside the ASCII range :P
is there any library or function that can automatically convert these unknown
html charset into UTF-8
You mean that tries to work out what character set a file is in and then
translates it?
(What is the current state of the art of character set detection
heuristics?)
Stewart.