Re: Searching of Chinese characters and English

2012-09-05 Thread waynelam

Thank you Lance.
I just found out the problem, in case somebody came across this.
It turn out to be the problem that tomcat is not accepting UTF-8 in URL 
by default.


http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

I have no idea why it is the case but after i follow the instruction in 
the document above.


Problem solved!!

Thanks so much for your help!


Wayne


On 6/9/2012 11:19, Lance Norskog wrote:

I believe that you should remove the Analyzer class name from the field type. I think 
it overrides the stacks of tokenizer/tokenfilter. Other  
declarations do not have an Analyzer class and Tokenizers.
  
should be:
  

This may not help with your searching problem.

- Original Message -
| From: "waynelam" 
| To: solr-user@lucene.apache.org
| Sent: Wednesday, September 5, 2012 8:07:36 PM
| Subject: Re: Searching of Chinese characters and English
|
| Any thoughts?
|
| It is weird, i can see the words are cutting correctly in Field
| Analysis. I checked almost every website that they are telling either
| CJKAnalyzer, IKAnalyzer or SmartChineseAnalyzer. But if i can see the
| words are cutting then it should not be the problem of settings of
| different Analyzer. Am I correct?
|
| Anyone have an idea or hints?
|
| Thanks so much
|
| Wayne
|
|
|
| On 4/9/2012 13:03, waynelam wrote:
| > Hi all,
| >
| > I tried to modified the schema.xml and solrconfig.xml come with
| > Drupal
| > "search_api_solr" modules. I tried to modified it so that it is
| > suitable for an CJK environment. I can see Chinese words cut up
| > each 2
| > words in "Field Analysis". If i use the following query
| >
| > 
my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title
| >
| >
| > I can see it returning results. The problem is when i change the
| > search keywords for one of my field (e.g. t_title) to Chinese
| > characters. It always shows
| >
| > 
| >
| > in the results. It is strange because if a title contains both
| > chinese
| > and english (e.g. testing ??), when i search just the english part
| > (e.g. fq=t_title:"testing"), i can find the result perfectly. It
| > just
| > happened to be problem when searching chinese characters.
| >
| >
| > Much appreciated if you guys can show me which part i did wrong.
| >
| > Thanks
| >
| > Wayne
| >
| > *My Settings:*
| > Java : 1.6.0_24
| > Solr : version 3.6.1
| > tomcat: version 6.0.35
| >
| > *My schema.xml* (i highlighted the place i changed from default)
| >
| > * stored="true" multiValued="true">**
| > **   class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
| > ** class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
| > ** generateWordParts="1" generateNumberParts="1" catenateWords="1"
| > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>**
| > ****
| > ** language="English" protected="protwords.txt"/>**
| > ** class="solr.RemoveDuplicatesTokenFilterFactory"/>**
| > ** version="icu4j" composed="false" remove_diacritics="true"
| > remove_modifiers="true" fold="true"/>**
| > ****
| > **  **
| > **   class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
| > ** class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
| > ** generateWordParts="1" generateNumberParts="1" catenateWords="0"
| > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>**
| > ****
| > ** language="English" protected="protwords.txt"/>**
| > ** class="solr.RemoveDuplicatesTokenFilterFactory"/>**
| > ** version="icu4j" composed="false" remove_diacritics="true"
| > remove_modifiers="true" fold="true"/>**
| > ****
| > **  **
| > ***
| >
| >  indexed="true"
| > stored="true" sortMissingLast="true" omitNorms="true">
| >   
| >
| > 
| >
| > 
| > 
| >   
| > 
| >
| >  indexed="true" />
| >
| >  class="solr.StrField" />
| >  
| >  
| >
| >stored="true"
| > required="true" />
| >stored="true"
| > required="true" />
| >stored="true"
| > required="true" />
| >
| >
| >

Re: Searching of Chinese characters and English

2012-09-05 Thread waynelam

Any thoughts?

It is weird, i can see the words are cutting correctly in Field 
Analysis. I checked almost every website that they are telling either 
CJKAnalyzer, IKAnalyzer or SmartChineseAnalyzer. But if i can see the 
words are cutting then it should not be the problem of settings of 
different Analyzer. Am I correct?


Anyone have an idea or hints?

Thanks so much

Wayne



On 4/9/2012 13:03, waynelam wrote:

Hi all,

I tried to modified the schema.xml and solrconfig.xml come with Drupal 
"search_api_solr" modules. I tried to modified it so that it is 
suitable for an CJK environment. I can see Chinese words cut up each 2 
words in "Field Analysis". If i use the following query


my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title 



I can see it returning results. The problem is when i change the 
search keywords for one of my field (e.g. t_title) to Chinese 
characters. It always shows




in the results. It is strange because if a title contains both chinese 
and english (e.g. testing ??), when i search just the english part 
(e.g. fq=t_title:"testing"), i can find the result perfectly. It just 
happened to be problem when searching chinese characters.



Much appreciated if you guys can show me which part i did wrong.

Thanks

Wayne

*My Settings:*
Java : 1.6.0_24
Solr : version 3.6.1
tomcat: version 6.0.35

*My schema.xml* (i highlighted the place i changed from default)

*stored="true" multiValued="true">**
**  class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
**class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
**generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>**

****
**language="English" protected="protwords.txt"/>**

****
**version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true"/>**

****
**  **
**  class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**
**class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>**
**generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>**

****
**language="English" protected="protwords.txt"/>**

****
**version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true"/>**

****
**  **
***

stored="true" sortMissingLast="true" omitNorms="true">

  





  




class="solr.StrField" />

 
 

   required="true" />
   required="true" />
   required="true" />


   
   multiValued="true"/>

   

*autoGeneratePhraseQueries="false"/>*

   
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />

   
   
   
 

 id
 






--
-
Wayne Lam
Assistant Librarian II
Systems Development & Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168576
Email:   wayne...@ln.edu.hk
Website: http://www.library.ln.edu.hk



Searching of Chinese characters and English

2012-09-03 Thread waynelam

Hi all,

I tried to modified the schema.xml and solrconfig.xml come with Drupal 
"search_api_solr" modules. I tried to modified it so that it is suitable 
for an CJK environment. I can see Chinese words cut up each 2 words in 
"Field Analysis". If i use the following query


my_ip_address:8080/solr/select?indent=on&version=2.2&fq=t_title:"Find"&start=0&rows=10&fl=t_title

I can see it returning results. The problem is when i change the search 
keywords for one of my field (e.g. t_title) to Chinese characters. It 
always shows




in the results. It is strange because if a title contains both chinese 
and english (e.g. testing ??), when i search just the english part (e.g. 
fq=t_title:"testing"), i can find the result perfectly. It just happened 
to be problem when searching chinese characters.



Much appreciated if you guys can show me which part i did wrong.

Thanks

Wayne

*My Settings:*
Java : 1.6.0_24
Solr : version 3.6.1
tomcat: version 6.0.35

*My schema.xml* (i highlighted the place i changed from default)

*stored="true" multiValued="true">**
**  class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**

****
**generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>**

****
**language="English" protected="protwords.txt"/>**

****
**version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true"/>**

****
**  **
**  class="org.apache.lucene.analysis.cjk.CJKAnalyzer">**

****
**generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>**

****
**language="English" protected="protwords.txt"/>**

****
**version="icu4j" composed="false" remove_diacritics="true" 
remove_modifiers="true" fold="true"/>**

****
**  **
***

stored="true" sortMissingLast="true" omitNorms="true">

  





  




class="solr.StrField" />

 
 

   required="true" />
   required="true" />
   required="true" />


   
   multiValued="true"/>

   

*autoGeneratePhraseQueries="false"/>*

   
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />
   termVectors="true" />

   
   
   
 

 id
 




SmartChineseAnalyzer

2011-12-09 Thread waynelam

Hi all,

I checked the documentation of SmartChineseAnalyzer, It looks like it is 
for Simplified Chinese Only.
Does anyone tried to include Traditional Chinese characters also. As the 
analyzer is based on a
dictionary from ICTCLAS1.0. My first thought is maybe i can get it work 
by simply convert the

whole dictionary to Traditional Chinese?

Btw, I checked ICTCLAS official website and it seems the newest version 
java library supports GB2312、GBK、UTF-8、BIG5.

So I can expect a roadmap for SmartChineseAnalyzer to support BIG5 later?



Anyone can show me some hint is much appreciated.



Regards,

Wayne


Re: Searching in Traditional / Simplified Chinese Record

2011-06-20 Thread waynelam

By "changing the record", i mean translate them word by word using software.
Sorry i m new for this kind of modification. For synonyms filter, would 
there be

a big table and result in degrade of indexing performance?

I have tried using filter like ICUTransformFilterFactory but it seems 
not works




generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1"/>
words="stopwords.txt" enablePositionIncrements="true"/>


protected="protwords.txt"/>


composed="false" remove_diacritics="true" remove_modifiers="true" 
fold="true"/>





Am i setting it wrong?


Regards,

Wayne



On 6/21/2011 2:30 AM, François Schiettecatte wrote:

Wayne

I am not sure what you mean by 'changing the record'.

One option would be to implement something like the synonyms filter to generate 
the TC for SC when you index the document, which would index both the TC and 
the SC in the same location. That way your users would be able to search with 
either TC or SC.

Another option would be to use the same synonyms filter but do the expansion at 
search time.

Cheers

François


On Jun 20, 2011, at 5:41 AM, waynelam wrote:


Hi,

I 've recently make change to my schema.xml to support import of Chinese Record.
What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and 
Simplified Chinese (SC) (e.g. ??) Record
when in the same query. I know I can do that by encoding all SC Record to TC. I 
want to change to way to index
rather that change the record.

Anyone should show me the way in much appreciated.


Thanks

Wayne


--
-
Wayne Lam
Assistant Library Officer I
Systems Development&   Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168585
Email:   wayne...@ln.edu.hk
Website: http://www.library.ln.edu.hk




--
-
Wayne Lam
Assistant Library Officer I
Systems Development&  Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168585
Email:   wayne...@ln.edu.hk
Website: http://www.library.ln.edu.hk



Searching in Traditional / Simplified Chinese Record

2011-06-20 Thread waynelam

Hi,

 I 've recently make change to my schema.xml to support import of 
Chinese Record.
What i want to do is to search both Traditional Chinese(TC) (e.g. ?? 
)and Simplified Chinese (SC) (e.g. ??) Record
when in the same query. I know I can do that by encoding all SC Record 
to TC. I want to change to way to index

rather that change the record.

Anyone should show me the way in much appreciated.


Thanks

Wayne


--
-
Wayne Lam
Assistant Library Officer I
Systems Development&  Support
Fong Sum Wood Library
Lingnan University
8 Castle Peak Road
Tuen Mun, New Territories
Hong Kong SAR
China
Phone:   +852 26168585
Email:   wayne...@ln.edu.hk
Website: http://www.library.ln.edu.hk