Re: Searching of Chinese characters and English

waynelam Wed, 05 Sep 2012 21:47:34 -0700

Thank you Lance.
I just found out the problem, in case somebody came across this.

It turn out to be the problem that tomcat is not accepting UTF-8 in URLby default.


http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

I have no idea why it is the case but after i follow the instruction inthe document above.


Problem solved!!

Thanks so much for your help!


Wayne


On 6/9/2012 11:19, Lance Norskog wrote:

I believe that you should remove the Analyzer class name from the field type. I think 
it overrides the stacks of tokenizer/tokenfilter. Other <fieldType> 
declarations do not have an Analyzer class and Tokenizers.
  <analyzer type="index" class="org.apache.lucene.analysis.cjk.CJKAnalyzer">
should be:
  <analyzer type="index">

This may not help with your searching problem.

----- Original Message -----
| From: "waynelam" <wayne...@ln.edu.hk>
| To: solr-user@lucene.apache.org
| Sent: Wednesday, September 5, 2012 8:07:36 PM
| Subject: Re: Searching of Chinese characters and English
|
| Any thoughts?
|
| It is weird, i can see the words are cutting correctly in Field
| Analysis. I checked almost every website that they are telling either
| CJKAnalyzer, IKAnalyzer or SmartChineseAnalyzer. But if i can see the
| words are cutting then it should not be the problem of settings of
| different Analyzer. Am I correct?
|
| Anyone have an idea or hints?
|
| Thanks so much
|
| Wayne
|
|
|
| On 4/9/2012 13:03, waynelam wrote:
| > Hi all,
| >
| > I tried to modified the schema.xml and solrconfig.xml come with
| > Drupal
| > "search_api_solr" modules. I tried to modified it so that it is
| > suitable for an CJK environment. I can see Chinese words cut up
| > each 2
| > words in "Field Analysis". If i use the following query
| >
| > 
my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title
| >
| >
| > I can see it returning results. The problem is when i change the
| > search keywords for one of my field (e.g. t_title) to Chinese
| > characters. It always shows
| >
| > <result name="response" numFound="0" start="0"/>
| >
| > in the results. It is strange because if a title contains both
| > chinese
| > and english (e.g. testing ??), when i search just the english part
| > (e.g. fq=t_title:"testing"), i can find the result perfectly. It
| > just
| > happened to be problem when searching chinese characters.
| >
| >
| > Much appreciated if you guys can show me which part i did wrong.
| >
| > Thanks
| >
| > Wayne
| >
| > *My Settings:*
| > Java : 1.6.0_24
| > Solr : version 3.6.1
| > tomcat: version 6.0.35
| >
| > *My schema.xml* (i highlighted the place i changed from default)
| >
| > *<fieldType name="text" class="solr.TextField" indexed="true"
| > stored="true" multiValued="true">**
| > **      <analyzer type="index"
| > class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
| > **        <tokenizer
| > class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
| > **        <filter class="solr.WordDelimiterFilterFactory"
| > generateWordParts="1" generateNumberParts="1" catenateWords="1"
| > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>**
| > **        <filter class="solr.LowerCaseFilterFactory"/>**
| > **        <filter class="solr.SnowballPorterFilterFactory"
| > language="English" protected="protwords.txt"/>**
| > **        <filter
| > class="solr.RemoveDuplicatesTokenFilterFactory"/>**
| > **        <filter class="schema.UnicodeNormalizationFilterFactory"
| > version="icu4j" composed="false" remove_diacritics="true"
| > remove_modifiers="true" fold="true"/>**
| > **        <filter class="solr.ISOLatin1AccentFilterFactory"/>**
| > **      </analyzer>**
| > **      <analyzer type="query"
| > class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
| > **        <tokenizer
| > class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
| > **        <filter class="solr.WordDelimiterFilterFactory"
| > generateWordParts="1" generateNumberParts="1" catenateWords="0"
| > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>**
| > **        <filter class="solr.LowerCaseFilterFactory"/>**
| > **        <filter class="solr.SnowballPorterFilterFactory"
| > language="English" protected="protwords.txt"/>**
| > **        <filter
| > class="solr.RemoveDuplicatesTokenFilterFactory"/>**
| > **        <filter class="schema.UnicodeNormalizationFilterFactory"
| > version="icu4j" composed="false" remove_diacritics="true"
| > remove_modifiers="true" fold="true"/>**
| > **        <filter class="solr.ISOLatin1AccentFilterFactory"/>**
| > **      </analyzer>**
| > **    </fieldType>*
| >
| >     <fieldType name="sortString" class="solr.TextField"
| >     indexed="true"
| > stored="true" sortMissingLast="true" omitNorms="true">
| >       <analyzer>
| >
| >         <tokenizer class="solr.KeywordTokenizerFactory"/>
| >
| >         <filter class="solr.LowerCaseFilterFactory" />
| >         <filter class="solr.TrimFilterFactory" />
| >       </analyzer>
| >     </fieldType>
| >
| >     <fieldType name="rand" class="solr.RandomSortField"
| >     indexed="true" />
| >
| >     <fieldtype name="ignored" stored="true" indexed="false"
| > class="solr.StrField" />
| >  </types>
| >  <fields>
| >
| >    <field name="id"       type="string" indexed="true"
| >    stored="true"
| > required="true" />
| >    <field name="item_id"  type="string" indexed="true"
| >    stored="true"
| > required="true" />
| >    <field name="index_id" type="string" indexed="true"
| >    stored="true"
| > required="true" />
| >
| >    <copyField source="item_id" dest="ss_search_api_id" />
| >    <field name="spell" type="textSpell" indexed="true"
| >    stored="true"
| > multiValued="true"/>
| >    <copyField source="t_*" dest="spell"/>
| >
| > *<field name="t_title" type="text" indexed="true" stored="true"
| > autoGeneratePhraseQueries="false"/>*
| >    <dynamicField name="t_*" type="text" termVectors="true" />
| >    <dynamicField name="ss_*" type="sortString" multiValued="false"
| > termVectors="true" />
| >    <dynamicField name="sm_*" type="sortString" multiValued="true"
| > termVectors="true" />
| >    <dynamicField name="is_*" type="tlong" multiValued="false"
| > termVectors="true" />
| >    <dynamicField name="im_*" type="long" multiValued="true"
| > termVectors="true" />
| >    <dynamicField name="fs_*" type="tdouble" multiValued="false"
| > termVectors="true" />
| >    <dynamicField name="fm_*" type="tdouble" multiValued="true"
| > termVectors="true" />
| >    <dynamicField name="ds_*" type="tdate" multiValued="false"
| > termVectors="true" />
| >    <dynamicField name="dm_*" type="tdate" multiValued="true"
| > termVectors="true" />
| >    <dynamicField name="bs_*" type="boolean" multiValued="false"
| > termVectors="true" />
| >    <dynamicField name="bm_*" type="boolean" multiValued="true"
| > termVectors="true" />
| >    <dynamicField name="f_ss_*" type="string" multiValued="false"
| > termVectors="true" />
| >    <dynamicField name="f_sm_*" type="string" multiValued="true"
| > termVectors="true" />
| >    <copyField source="ss_*" dest="f_ss_*" />
| >    <copyField source="sm_*" dest="f_sm_*" />
| >    <dynamicField name="*" type="ignored" multiValued="true" />
| >  </fields>
| >
| >  <uniqueKey>id</uniqueKey>
| >  <solrQueryParser defaultOperator="AND"/>
| >
| > </schema>
| >
|
|
| --
| -----------------------------------------
| Wayne Lam
| Assistant Librarian II
| Systems Development & Support
| Fong Sum Wood Library
| Lingnan University
| 8 Castle Peak Road
| Tuen Mun, New Territories
| Hong Kong SAR
| China
| Phone:   +852 26168576
| Email:   wayne...@ln.edu.hk
| Website: http://www.library.ln.edu.hk
|
|



--
-----------------------------------------
Wayne Lam
Assistant Librarian II
Systems Development & Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168576
Email:   wayne...@ln.edu.hk
Website: http://www.library.ln.edu.hk

Re: Searching of Chinese characters and English

Reply via email to