KK schrieb:
I'd like to know about the different Unicode[/any other?] encodings supported by Solr for posting docs [thru Solrj in my case]. Is it that just UTF-8, UCN supported or other character encodings like NCR(decimal), NCR(hex) etc are supported as well?
Any numerical character reference (NCR), decimal or hexadecimal, is valid UTF-8 as long as it maps to a valid Unicode character.
I found that for most of the pages the encoding is UTF-8[in this case searching works fine] but for others the encoding is some other character encoding[like NCR(dec), NCR(hex) or might be something else, don't have much idea on this].
Whatever the encoding is, your application needs to know what it is when dealing with bytes read from the network.
So when I fetch the page content thru java methods using InputSteamReaders and after stripping various tags what I obtained is raw text with some encoding not getting supported by Solr.
Did you make sure to not rely on your platform default encoding (Charset) when constructing the InputStreamReader? If in doubt, take a look at the InputStreamReader constructors. Michael Ludwig