Re: Problem adding unicoded docs to Solr through SolrJ

Michael Ludwig Thu, 30 Apr 2009 02:46:47 -0700

ahmed baseet schrieb:

I tried something stupid but working though. I first converted the
whole string to byte array and then used that byte array to create a
new utf-8 encoded sting like this,


// Encode in Unicode UTF-8
                byte [] utfEncodeByteArray = textOnly.getBytes();


This yields a sequence of bytes using the platform's default charset,
which may not be UTF-8. Check:

* String#getBytes()
* String#getBytes(String charsetName)

                String utfString = new String(utfEncodeByteArray,
Charset.forName("UTF-8"));


Note that strings in Java are always internally encoded in UTF-16, so it
doesn't make much sense to call it utfString, especially if you think
that it is encoded in UTF-8, which it is not.

The above operation is only guaranteed to succeed without losing data
(resulting in ? in the output) when the sequence of bytes is valid as
UTF-8, i.e. in this case when your platform encoding, which you've
relied upon, is UTF-8.

then passed the utfString to the function for posting to Solr and it
works prefectly.
But is there any intelligent way of doing all this, like straight from
default encoded string to utf-8 encoded string, without going via byte
array.


It is a feature of the java.lang.String that you don't need to know the
encoding, as the string contains characters, not bytes. Only for input
and output you are concerned with encoding. So where you're dealing with
encodings, you're dealing with bytes.

And when dealing with bytes on the wire, you're likely concerned with
encodings, for example when the page you read via HTTP comes with a
Content-Type header specifying the encoding, or when you send documents
to the Solr indexer.

For more "intelligent" ways, you could take a look at the class
java.nio.charset.Charset and the methods encode, decode, newEncoder,
newDecoder.

Michael Ludwig

Re: Problem adding unicoded docs to Solr through SolrJ

Reply via email to