KK schrieb:

I'd like to know about the different Unicode[/any other?] encodings
supported by Solr for posting docs [thru Solrj in my case]. Is it that
just UTF-8, UCN  supported or other character encodings like
NCR(decimal), NCR(hex) etc are supported as well?

Any numerical character reference (NCR), decimal or hexadecimal, is
valid UTF-8 as long as it maps to a valid Unicode character.

I found that for most of the pages the encoding is UTF-8[in this case
searching works fine] but for others the encoding is some other
character encoding[like NCR(dec), NCR(hex) or might be something else,
don't have much idea on this].

Whatever the encoding is, your application needs to know what it is when
dealing with bytes read from the network.

So when I fetch the page content thru java methods using
InputSteamReaders and after stripping various tags what I obtained
is raw text with some encoding not getting supported by Solr.

Did you make sure to not rely on your platform default encoding
(Charset) when constructing the InputStreamReader? If in doubt, take
a look at the InputStreamReader constructors.

Michael Ludwig

Reply via email to