Re: Searching of Chinese characters and English
Thank you Lance. I just found out the problem, in case somebody came across this. It turn out to be the problem that tomcat is not accepting UTF-8 in URL by default. http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config I have no idea why it is the case but after i follow the instruction in the document above. Problem solved!! Thanks so much for your help! Wayne On 6/9/2012 11:19, Lance Norskog wrote: I believe that you should remove the Analyzer class name from the field type. I think it overrides the stacks of tokenizer/tokenfilter. Other declarations do not have an Analyzer class and Tokenizers. should be: This may not help with your searching problem. - Original Message - | From: "waynelam" | To: solr-user@lucene.apache.org | Sent: Wednesday, September 5, 2012 8:07:36 PM | Subject: Re: Searching of Chinese characters and English | | Any thoughts? | | It is weird, i can see the words are cutting correctly in Field | Analysis. I checked almost every website that they are telling either | CJKAnalyzer, IKAnalyzer or SmartChineseAnalyzer. But if i can see the | words are cutting then it should not be the problem of settings of | different Analyzer. Am I correct? | | Anyone have an idea or hints? | | Thanks so much | | Wayne | | | | On 4/9/2012 13:03, waynelam wrote: | > Hi all, | > | > I tried to modified the schema.xml and solrconfig.xml come with | > Drupal | > "search_api_solr" modules. I tried to modified it so that it is | > suitable for an CJK environment. I can see Chinese words cut up | > each 2 | > words in "Field Analysis". If i use the following query | > | > my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title | > | > | > I can see it returning results. The problem is when i change the | > search keywords for one of my field (e.g. t_title) to Chinese | > characters. It always shows | > | > | > | > in the results. It is strange because if a title contains both | > chinese | > and english (e.g. testing ??), when i search just the english part | > (e.g. fq=t_title:"testing"), i can find the result perfectly. It | > just | > happened to be problem when searching chinese characters. | > | > | > Much appreciated if you guys can show me which part i did wrong. | > | > Thanks | > | > Wayne | > | > *My Settings:* | > Java : 1.6.0_24 | > Solr : version 3.6.1 | > tomcat: version 6.0.35 | > | > *My schema.xml* (i highlighted the place i changed from default) | > | > * stored="true" multiValued="true">** | > ** class="org.apache.lucene.analysis.cjk.CJKAnalyzer">** | > ** class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>** | > ** generateWordParts="1" generateNumberParts="1" catenateWords="1" | > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>** | > **** | > ** language="English" protected="protwords.txt"/>** | > ** class="solr.RemoveDuplicatesTokenFilterFactory"/>** | > ** version="icu4j" composed="false" remove_diacritics="true" | > remove_modifiers="true" fold="true"/>** | > **** | > ** ** | > ** class="org.apache.lucene.analysis.cjk.CJKAnalyzer">** | > ** class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>** | > ** generateWordParts="1" generateNumberParts="1" catenateWords="0" | > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>** | > **** | > ** language="English" protected="protwords.txt"/>** | > ** class="solr.RemoveDuplicatesTokenFilterFactory"/>** | > ** version="icu4j" composed="false" remove_diacritics="true" | > remove_modifiers="true" fold="true"/>** | > **** | > ** ** | > *** | > | > indexed="true" | > stored="true" sortMissingLast="true" omitNorms="true"> | > | > | > | > | > | > | > | > | > | > indexed="true" /> | > | > class="solr.StrField" /> | > | > | > | >stored="true" | > required="true" /> | >stored="true" | > required="true" /> | >stored="true" | > required="true" /> | > | > | >
Re: Searching of Chinese characters and English
Any thoughts? It is weird, i can see the words are cutting correctly in Field Analysis. I checked almost every website that they are telling either CJKAnalyzer, IKAnalyzer or SmartChineseAnalyzer. But if i can see the words are cutting then it should not be the problem of settings of different Analyzer. Am I correct? Anyone have an idea or hints? Thanks so much Wayne On 4/9/2012 13:03, waynelam wrote: Hi all, I tried to modified the schema.xml and solrconfig.xml come with Drupal "search_api_solr" modules. I tried to modified it so that it is suitable for an CJK environment. I can see Chinese words cut up each 2 words in "Field Analysis". If i use the following query my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title I can see it returning results. The problem is when i change the search keywords for one of my field (e.g. t_title) to Chinese characters. It always shows in the results. It is strange because if a title contains both chinese and english (e.g. testing ??), when i search just the english part (e.g. fq=t_title:"testing"), i can find the result perfectly. It just happened to be problem when searching chinese characters. Much appreciated if you guys can show me which part i did wrong. Thanks Wayne *My Settings:* Java : 1.6.0_24 Solr : version 3.6.1 tomcat: version 6.0.35 *My schema.xml* (i highlighted the place i changed from default) *stored="true" multiValued="true">** ** class="org.apache.lucene.analysis.cjk.CJKAnalyzer">** **class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>** **generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>** **** **language="English" protected="protwords.txt"/>** **** **version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>** **** ** ** ** class="org.apache.lucene.analysis.cjk.CJKAnalyzer">** **class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>** **generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>** **** **language="English" protected="protwords.txt"/>** **** **version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>** **** ** ** *** stored="true" sortMissingLast="true" omitNorms="true"> class="solr.StrField" /> required="true" /> required="true" /> required="true" /> multiValued="true"/> *autoGeneratePhraseQueries="false"/>* termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> id -- - Wayne Lam Assistant Librarian II Systems Development & Support Fong Sum Wood Library Lingnan University 8 Castle Peak Road Tuen Mun, New Territories Hong Kong SAR China Phone: +852 26168576 Email: wayne...@ln.edu.hk Website: http://www.library.ln.edu.hk
Searching of Chinese characters and English
Hi all, I tried to modified the schema.xml and solrconfig.xml come with Drupal "search_api_solr" modules. I tried to modified it so that it is suitable for an CJK environment. I can see Chinese words cut up each 2 words in "Field Analysis". If i use the following query my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title I can see it returning results. The problem is when i change the search keywords for one of my field (e.g. t_title) to Chinese characters. It always shows in the results. It is strange because if a title contains both chinese and english (e.g. testing ??), when i search just the english part (e.g. fq=t_title:"testing"), i can find the result perfectly. It just happened to be problem when searching chinese characters. Much appreciated if you guys can show me which part i did wrong. Thanks Wayne *My Settings:* Java : 1.6.0_24 Solr : version 3.6.1 tomcat: version 6.0.35 *My schema.xml* (i highlighted the place i changed from default) *stored="true" multiValued="true">** ** class="org.apache.lucene.analysis.cjk.CJKAnalyzer">** **** **generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>** **** **language="English" protected="protwords.txt"/>** **** **version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>** **** ** ** ** class="org.apache.lucene.analysis.cjk.CJKAnalyzer">** **** **generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>** **** **language="English" protected="protwords.txt"/>** **** **version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>** **** ** ** *** stored="true" sortMissingLast="true" omitNorms="true"> class="solr.StrField" /> required="true" /> required="true" /> required="true" /> multiValued="true"/> *autoGeneratePhraseQueries="false"/>* termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> termVectors="true" /> id
SmartChineseAnalyzer
Hi all, I checked the documentation of SmartChineseAnalyzer, It looks like it is for Simplified Chinese Only. Does anyone tried to include Traditional Chinese characters also. As the analyzer is based on a dictionary from ICTCLAS1.0. My first thought is maybe i can get it work by simply convert the whole dictionary to Traditional Chinese? Btw, I checked ICTCLAS official website and it seems the newest version java library supports GB2312、GBK、UTF-8、BIG5. So I can expect a roadmap for SmartChineseAnalyzer to support BIG5 later? Anyone can show me some hint is much appreciated. Regards, Wayne
Re: Searching in Traditional / Simplified Chinese Record
By "changing the record", i mean translate them word by word using software. Sorry i m new for this kind of modification. For synonyms filter, would there be a big table and result in degrade of indexing performance? I have tried using filter like ICUTransformFilterFactory but it seems not works generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> words="stopwords.txt" enablePositionIncrements="true"/> protected="protwords.txt"/> composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/> Am i setting it wrong? Regards, Wayne On 6/21/2011 2:30 AM, François Schiettecatte wrote: Wayne I am not sure what you mean by 'changing the record'. One option would be to implement something like the synonyms filter to generate the TC for SC when you index the document, which would index both the TC and the SC in the same location. That way your users would be able to search with either TC or SC. Another option would be to use the same synonyms filter but do the expansion at search time. Cheers François On Jun 20, 2011, at 5:41 AM, waynelam wrote: Hi, I 've recently make change to my schema.xml to support import of Chinese Record. What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and Simplified Chinese (SC) (e.g. ??) Record when in the same query. I know I can do that by encoding all SC Record to TC. I want to change to way to index rather that change the record. Anyone should show me the way in much appreciated. Thanks Wayne -- - Wayne Lam Assistant Library Officer I Systems Development& Support Fong Sum Wood Library Lingnan University 8 Castle Peak Road Tuen Mun, New Territories Hong Kong SAR China Phone: +852 26168585 Email: wayne...@ln.edu.hk Website: http://www.library.ln.edu.hk -- - Wayne Lam Assistant Library Officer I Systems Development& Support Fong Sum Wood Library Lingnan University 8 Castle Peak Road Tuen Mun, New Territories Hong Kong SAR China Phone: +852 26168585 Email: wayne...@ln.edu.hk Website: http://www.library.ln.edu.hk
Searching in Traditional / Simplified Chinese Record
Hi, I 've recently make change to my schema.xml to support import of Chinese Record. What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and Simplified Chinese (SC) (e.g. ??) Record when in the same query. I know I can do that by encoding all SC Record to TC. I want to change to way to index rather that change the record. Anyone should show me the way in much appreciated. Thanks Wayne -- - Wayne Lam Assistant Library Officer I Systems Development& Support Fong Sum Wood Library Lingnan University 8 Castle Peak Road Tuen Mun, New Territories Hong Kong SAR China Phone: +852 26168585 Email: wayne...@ln.edu.hk Website: http://www.library.ln.edu.hk