Hi KK,
On 5/7/2009 at 2:55 AM, KK wrote:
In some of the pages I'm getting some \ufffd chars which I think is
some sort of unmappable[by Java?] character, right?. Any idea on how
to handle this? Just replacing with blank char will not do [this
depends on the requirement, though].
From http://www.unicode.org/charts/PDF/UFFF0.pdf:
FFFD: REPLACEMENT CHARACTER: used to replace an
incoming character whose value is unknown or
unrepresentable in Unicode.
Also, from http://www.unicode.org/versions/Unicode5.1.0/:
Applications are free to use any of these noncharacter
code points internally but should never attempt to
exchange them. If a noncharacter is received in open
interchange, an application is not required to
interpret it in any way. It is good practice, however,
to recognize it as a noncharacter and to take
appropriate action, such as replacing it with U+FFFD
REPLACEMENT CHARACTER, to indicate the problem in the
text. It is not recommended to simply delete
noncharacter code points from such text, because of
the potential security issues caused by deleting
uninterpreted characters. (See conformance clause C7
in Section 3.2, Conformance Requirements, and Unicode
Technical Report #36, Unicode Security
Considerations.)
So if you're seeing \ufffd in text, you (or someone before you in the
processing chain) attempted to convert the text from some other encoding into
Unicode, but the encoding conversion failed (no target Unicode character
corresponding to the source character). This can happen when attempting to
convert from an incorrectly identified source encoding.
Steve