Hi KK, On 5/7/2009 at 2:55 AM, KK wrote: > In some of the pages I'm getting some \ufffd chars which I think is > some sort of unmappable[by Java?] character, right?. Any idea on how > to handle this? Just replacing with blank char will not do [this > depends on the requirement, though].
>From <http://www.unicode.org/charts/PDF/UFFF0.pdf>: FFFD: REPLACEMENT CHARACTER: used to replace an incoming character whose value is unknown or unrepresentable in Unicode. Also, from <http://www.unicode.org/versions/Unicode5.1.0/>: Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters. (See conformance clause C7 in Section 3.2, Conformance Requirements, and Unicode Technical Report #36, "Unicode Security Considerations.") So if you're seeing \ufffd in text, you (or someone before you in the processing chain) attempted to convert the text from some other encoding into Unicode, but the encoding conversion failed (no target Unicode character corresponding to the source character). This can happen when attempting to convert from an incorrectly identified source encoding. Steve