Sounds like one of three things:
1/ Everything is set to UTF-*, but the content has another encoding.
2/ Something 'mirocosoftish' is adding a BOM (byte order mark) that is being 
incorrectly interpreted.
3/ The byte order is wrong somewhere along the way and not being translated 
correctly across machine/media boundaries.


You need to look at what your source is providing, directly first, before it 
gets into the database. Then do the following.

I would open up an editor that you KNOW outputs utf-8:

1/ Compose a web page, view it with fonts set to UTF8, that will tell you that 
it is really creating UTF-8 files. (Obviously use some character over 0xFF)

2/ Build an SQL query with it that inserts one record, or many, using those 
characters. Try commandline, server side language, and any  DBase management 
program also. Make the records distinct relative to where they are being 
inserted from.

3/ Select these records and view on a web page set to UTF-8 and see if they 
come out of the database OK.

4/ Import inot Solr, and view again in a browser set to UTF-8
Dennis Gearon

Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Fri, 10/22/10, virtas <pkaro...@gmail.com> wrote:

> From: virtas <pkaro...@gmail.com>
> Subject: Failing to successfully import international characters via DIH
> To: solr-user@lucene.apache.org
> Date: Friday, October 22, 2010, 8:20 AM
> 
> Hi, 
> 
> wanted to share problem i have got with importing text from
> different
> languages. All international text looks wrong on luke and
> on AJAX solr. 
> 
> 
> What I see for chinese and japanese characters is this:
> æ˜
> 画や音楽ãŒæ¥½ã—ã„ï¼AIã®ã‚µã‚¤ãƒ¢ãƒ³ã®ãƒ•ã‚¡ãƒ³ã§ã™ã€‚アダãƒ
> やマットãŒå¥½ãã§ã™ã€‚LeeDeWyze優å‹ï¼I
> 
> Although it should be:
> 映画や音楽が楽しい!AIのサイモンのファンです。アダムやマットが好きです。
> 
> My setup is Ubuntu server 10.04, Tomcat6, Solr 1.4 and
> mysql. 
> 
> Things i have configured but with no luck:
>  1. /etc/tomcat6/server.xml contains this
> <Connector port="8080" protocol="HTTP/1.1" 
>            
>    connectionTimeout="20000" 
>            
>    URIEncoding="UTF-8"
>            
>    redirectPort="8443" />
>  2. /etc/mysql/my.cnf contains:
>  [mysqld]
>   .... 
>  default-character-set = utf8
>   character-set-server = utf8
>   
>  3. /etc/solr/conf/data-config.xml 
>  <dataConfig>
>   <dataSource type="JdbcDataSource" 
>              
> driver="com.mysql.jdbc.Driver"
>              
> url="jdbc:mysql://localhost:3306/spuvocom_spuvo?characterEncoding=UTF-8"
> 
> 
>            
>    encoding = "UTF-8" />
>   <document>
>  4. my mysql table collation is utf8_bin   
> 
> 
> What would you recommend changing or checking?
> 
> Thanks in advance 
> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Failing-to-successfully-import-international-characters-via-DIH-tp1753190p1753190.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
>

Reply via email to