Re: What are the Unicode encodings supported by Solr?

2009-05-08 Thread Michael Ludwig

KK schrieb:


I'd like to know about the different Unicode[/any other?] encodings
supported by Solr for posting docs [thru Solrj in my case]. Is it that
just UTF-8, UCN  supported or other character encodings like
NCR(decimal), NCR(hex) etc are supported as well?


Any numerical character reference (NCR), decimal or hexadecimal, is
valid UTF-8 as long as it maps to a valid Unicode character.


I found that for most of the pages the encoding is UTF-8[in this case
searching works fine] but for others the encoding is some other
character encoding[like NCR(dec), NCR(hex) or might be something else,
don't have much idea on this].


Whatever the encoding is, your application needs to know what it is when
dealing with bytes read from the network.


So when I fetch the page content thru java methods using
InputSteamReaders and after stripping various tags what I obtained
is raw text with some encoding not getting supported by Solr.


Did you make sure to not rely on your platform default encoding
(Charset) when constructing the InputStreamReader? If in doubt, take
a look at the InputStreamReader constructors.

Michael Ludwig


RE: What are the Unicode encodings supported by Solr?

2009-05-07 Thread Steven A Rowe
Hi KK,

On 5/7/2009 at 2:55 AM, KK wrote:
 In some of the pages I'm getting some \ufffd chars which I think is
 some sort of unmappable[by Java?] character, right?. Any idea on how
 to handle this? Just replacing with blank char will not do [this
 depends on the requirement, though].

From http://www.unicode.org/charts/PDF/UFFF0.pdf:

FFFD: REPLACEMENT CHARACTER: used to replace an
incoming character whose value is unknown or
unrepresentable in Unicode.

Also, from http://www.unicode.org/versions/Unicode5.1.0/:

Applications are free to use any of these noncharacter
code points internally but should never attempt to
exchange them. If a noncharacter is received in open
interchange, an application is not required to
interpret it in any way. It is good practice, however,
to recognize it as a noncharacter and to take
appropriate action, such as replacing it with U+FFFD
REPLACEMENT CHARACTER, to indicate the problem in the
text. It is not recommended to simply delete
noncharacter code points from such text, because of
the potential security issues caused by deleting
uninterpreted characters. (See conformance clause C7
in Section 3.2, Conformance Requirements, and Unicode
Technical Report #36, Unicode Security
Considerations.)

So if you're seeing \ufffd in text, you (or someone before you in the 
processing chain) attempted to convert the text from some other encoding into 
Unicode, but the encoding conversion failed (no target Unicode character 
corresponding to the source character).  This can happen when attempting to 
convert from an incorrectly identified source encoding.

Steve