UTF8 compatibility

2009-04-29 Thread Muhammed Sameer

Salaam,

I have a question, its in two parts actually and are related

We run post.jar periodically ie after every 15mins to commit the changes, Is 
this approach correct ?

When I run this I get the following message
{code}
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, 
other encodings are not currently supported
SimplePostTool: COMMITting Solr index changes..
{code}

So I tried to run the test_utf8.sh script and got the following output
{code}
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic 
multilingual plane
{code}

Are these errors normal or do I need to change something ?

Thanks for your time.

Regards,
Muhammed Sameer


  


Re: UTF8 compatibility

2009-04-29 Thread Michael Ludwig

Muhammed Sameer schrieb:


We run post.jar periodically ie after every 15mins to commit the
changes, Is this approach correct ?


Sounds reasonable to me.


SimplePostTool: WARNING: Make sure your XML documents are encoded in
UTF-8, other encodings are not currently supported


That's just to remind you not to try and post documents in another
encoding. This seems to be a limitation of the SimplePostTool, not of
Solr. I guess the reason is that in order for Solr to work quickly and
reliably, it relies on the Content-Type of the request to determine the
encoding. If, for example, you send XML encoded in ISO-8859-1, you have
to specify that in two places:

* XML declaration: ?xml version=1.0 encoding=ISO-8859-1?
* HTTP header: Content-Type: text/xml; charset=ISO-8859-1

The SimplePostTool, however, being just what the name says, may not
bother to read the encoding from the document and bring the HTTP content
type header in line. Instead, it explicitly requests UTF-8, probably in
the interest of simplicity.

Well, that's just my theory. Can anyone confirm?


So I tried to run the test_utf8.sh script and got the following output
{code}
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic 
multilingual plane
{code}

Are these errors normal or do I need to change something ?


I'm seeing the same output, don't worry, just some tests. It is possible
to have Solr index documents containing characters outside of the BMP
(Basic Multilingual Plane), which can be verified posting something like
this:

add
  doc
field name=id1001/field
field name=titleBMP plus 1 #x1;/field
  /doc
/add

Maybe the test script output says that such characters cannot be used
for querying. Hardly relevant if you consider that the BMP comprises
even languages such as Telugu, Bopomofo and French.

Best,

Michael Ludwig


Re: UTF8 compatibility

2009-04-29 Thread Shalin Shekhar Mangar
On Wed, Apr 29, 2009 at 12:45 PM, Muhammed Sameer samix_...@yahoo.comwrote:


 So I tried to run the test_utf8.sh script and got the following output
 {code}
 Solr server is up.
 HTTP GET is accepting UTF-8
 HTTP POST is accepting UTF-8
 HTTP POST defaults to UTF-8
 ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
 ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
 ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
 multilingual plane
 {code}


Make sure your tomcat (or whichever container you are using) is setup to
accept UTF-8 for quering. Instructions for tomcat at
http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4
-- 
Regards,
Shalin Shekhar Mangar.