Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory?
The output below is from the filters you've shown me. "highlighting":{ "chinese1":{ "id":["chinese1"], "title":["<em>我国</em>1<em>月份的制造业产值同比仅增长</em>0"], "content":[",<em>但比经济师普遍预估的增长</em>3.3%<em>疲软得多</em>。<em>这也意味着</em>,<em>我国今年第一季度的经济很可能让人失望</em> \n "], "author":["<em>Edwin</em>"]}, "chinese2":{ "id":["chinese2"], "content":["<em>铜牌</em>,<em>让我国暂时高居奖牌荣誉榜榜首</em>。 <em>你看好新加坡在本届的东运会中</em>,<em>会夺得多少面金牌</em>? <em>请在</em>6月<em>12</em><em>日中午前</em>,<em>投票并留言为我国健将寄上祝语吧</em> \n "], "author":["<em>Edwin</em>"]}, "chinese3":{ "id":["chinese3"], "content":[")<em>组成的我国女队在今天的东运会保龄球女子三人赛中</em>, <em>以六局</em>3963<em>总瓶分夺冠</em>,<em>为新加坡赢得本届赛会第三枚金牌</em>。<em>队友陈诗桦</em>(Jazreel)、<em>梁蕙芬和陈诗静以</em>3707<em>总瓶分获得亚军</em>,<em>季军归菲律宾女队</em>。(<em>联合早报记者</em>:<em>郭嘉惠</em>) \n "], "author":["<em>Edwin</em>"]}, "chinese4":{ "id":["chinese4"], "content":[",<em>则可获得一架价值</em>309<em>元的</em>Philips Viva Collection HD9045<em>面包机</em>。 \n <em>欲订从速</em>,<em>读者可登陆</em>www.wbsub.com.sg,<em>或拨打客服专线</em>6319 1800<em>订购</em>。 \n <em>此外</em>,<em>一年一度的晚报保健美容展</em>,<em>将在本月</em><em>23</em><em>日和</em><em>24</em>日,<em>在新达新加坡会展中心</em>401、402<em>展厅举行</em>。 \n <em>现场将开设</em>《<em>联合晚报</em>》<em>订阅展摊</em>,<em>读者当场订阅晚报</em>,<em>除了可获得丰厚的赠品</em>,<em>还有机会参与</em>“<em>必胜</em>”<em>幸运抽奖</em>"], "author":["<em>Edwin</em>"]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma <markus.jel...@openindex.io>: > Hi - we are actually using some other filters for Chinese, although they > are not specialized for Chinese: > > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.CJKBigramFilterFactory"/> > > > -----Original message----- > > From:Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > Sent: Thursday 25th June 2015 11:24 > > To: solr-user@lucene.apache.org > > Subject: Re: Tokenizer and Filter Factory to index Chinese characters > > > > Thank you. > > > > I've tried that, but when I do a search, it's returning much more > > highlighted results that what it supposed to. > > > > For example, if I enter the following query: > > http://localhost:8983/solr/chinese1/highlight?q=我国 > > > > I get the following results: > > > > "highlighting":{ > > "chinese1":{ > > "id":["chinese1"], > > > "title":["<em>我国</em>1<em>月份</em>的制造业<em>产值</em><em>同比</em>仅<em>增长</em>0"], > > > > "content":["<em>结束</em><em>连续</em>两个月的<em>萎缩</em>,但比经济师<em>普遍</em><em>预估</em>的<em>增长</em>3.3%<em>疲软</em>得多。这也意味着,<em>我国</em><em>今年</em><em>第一</em><em>季度</em>的<em>经济</em>很<em>可能</em>让人<em>失望</em> > > \n "], > > "author":["<em>Edwin</em>"]}, > > "chinese2":{ > > "id":["chinese2"], > > > > "content":["<em>铜牌</em>,让<em>我国</em><em>暂时</em><em>高居</em><em>奖牌</em><em>荣誉</em>榜<em>榜首</em>。 > > 你看好新加坡在本届的东运会中,会<em>夺得</em><em>多少</em>面<em>金牌</em>? > > > 请在6月<em>12</em>日<em>中午</em>前,<em>投票</em>并<em>留言</em>为<em>我国</em><em>健将</em>寄上<em>祝语</em>吧 > > \n "], > > "author":["<em>Edwin</em>"]}, > > "chinese3":{ > > "id":["chinese3"], > > > > "content":[")<em>组成</em>的<em>我国</em><em>女队</em>在<em>今天</em>的东运会保龄球<em>女子</em>三人赛中, > > > 以六局3963总瓶分<em>夺冠</em>,为新加坡<em>赢得</em><em>本届</em><em>赛会</em>第三枚<em>金牌</em>。<em>队友</em>陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分<em>获得</em><em>亚军</em>,<em>季军</em>归菲律宾<em>女队</em>。(<em>联合</em><em>早报</em><em>记者</em>:郭嘉惠) > > \n "], > > "author":["<Edwin"]}, > > "chinese4":{ > > "id":["chinese4"], > > > "content":["<em>配套</em>的<em>读者</em>,则可<em>获得</em>一架<em>价值</em>309元的Philips > > Viva Collection <em>HD</em>9045面<em>包机</em>。 \n > > 欲订从速,<em>读者</em>可<em>登陆</em>www.wbsub.com > .<em>sg</em>,或拨打客服<em>专线</em>6319 > > 1800<em>订购</em>。 \n > > > <em>此外</em>,一年一度的<em>晚报</em><em>保健</em><em>美容</em>展,将在<em>本月</em><em>23</em>日和<em>24</em>日,在新达新加坡<em>会展</em><em>中心</em>401、402<em>展厅</em><em>举行</em>。 > > \n > <em>现场</em>将<em>开设</em>《<em>联合</em><em>晚报</em>》<em>订阅</em>展摊,<em>读者</em><em>当场</em><em>订阅</em><em>晚报</em>,<em>除了</em>可<em>获得</em><em>丰厚</em>的<em>赠品</em>,还有<em>机会</em><em>参与</em>“"], > > "author":["<em>Edwin</em>"]}}} > > > > > > Is there any suitable filter factory to solve this issue? > > > > I've tried WordDelimiterFilterFactory, PorterStemFilterFactory > > and StopFilterFactory, but there's no improvement in the search results. > > > > > > Regards, > > Edwin > > > > > > On 25 June 2015 at 17:17, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > > > Hello - you can use HMMChineseTokenizerFactory instead. > > > > > > > http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html > > > > > > -----Original message----- > > > > From:Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > > > Sent: Thursday 25th June 2015 11:02 > > > > To: solr-user@lucene.apache.org > > > > Subject: Tokenizer and Filter Factory to index Chinese characters > > > > > > > > Hi, > > > > > > > > Does anyone knows what is the correct replacement for these 2 > tokenizer > > > and > > > > filter factory to index chinese into Solr? > > > > - SmartChineseSentenceTokenizerFactory > > > > - SmartChineseWordTokenFilterFactory > > > > > > > > I understand that these 2 tokenizer and filter factory are already > > > > deprecated in Solr 5.1, but I can't seem to find the correct > replacement. > > > > > > > > > > > > <fieldType name="text_smartcn" class="solr.TextField" > > > > positionIncrementGap="0"> > > > > <analyzer type="index"> > > > > <tokenizer > > > > > > > > class="org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory"/> > > > > <filter > > > > > > > > class="org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory"/> > > > > </analyzer> > > > > <analyzer type="query"> > > > > <tokenizer > > > > > > > > class="org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory"/> > > > > <filter > > > > > > > > class="org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory"/> > > > > </analyzer> > > > > </fieldType> > > > > > > > > Thank you. > > > > > > > > > > > > Regards, > > > > Edwin > > > > > > > > > >