Re: Tokenizer and Filter Factory to index Chinese characters

Zheng Lin Edwin Yeo Mon, 06 Jul 2015 03:31:38 -0700

Yes, I tried that also, but I faced some compatibility issues with Solr
5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
versions.


I got the following error when I tried to start Solr with Paoding
configured:

java.lang.VerifyError: class
net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
method 
tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(Unknown Source)
        at java.security.SecureClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.access$100(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(Unknown Source)
        at java.security.SecureClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.access$100(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at 
org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Unknown Source)
        at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
        at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
        at 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
        at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
        at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
        at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
        at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:175)
        at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
        at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
        at 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
        at 
org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)



Regards,
Edwin


2015-07-06 16:37 GMT+08:00 davidphilip cherian <davidphilipcher...@gmail.com
>:

> Hi Edwin,
>
> Have you tried the Paoding analyzer?  It is not out of the box shipped with
> Solr jars. You may have to download it and add it to solr libs.
>
> https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding
>
>
>
> 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo <edwinye...@gmail.com>:
>
> > I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
> > characters can work when I use the Query tab in Solr Admin UI.
> >
> > In the Admin UI, it converts the Chinese characters to code before
> passing
> > it to the URL, so it looks something like this:
> >
> >
> http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1&wt=json&indent=true&hl=true
> >
> > "highlighting":{
> >
> >     "chinese5":{
> >
> >       "text":["园将办系列活动庆祝入遗 \n 从<em>胡姬花</em>展到音
> > 乐会，为庆祝申遗成功，植物园这个月起将举办一系列活动与公众一同庆贺。
> > 本月10日开始的“新加坡植物园<em>胡姬</em>及其文化遗产”展览，将展出1万
> > 6000株<em>胡姬花</em>，这是"]},
> >
> >     "chinese3":{
> >
> >       "text":[" \n 原版为 马来语 《Majulah Singapura》，中文译为《 前  进吧，新加坡 》。 \n  \n
> > \t  国花 \n 新加坡以一种名为 卓  锦  ·  万代  兰
> > 的<em>胡姬花</em>为国花。东南亚通称兰花为<em>胡姬花</em>"]}}}
> >
> >
> >
> > However, if I enter the Chinese characters directly into the URL, the
> > results I get are wrong.
> >
> > http://localhost:8983/solr/chinese2/select?q=胡姬花&hl=true&hl.fl=text
> >
> >
> >   "highlighting":{
> >
> >     "chinese1":{
> >
> >       "text":["1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长<em>0.9</em>％。
> > 虽然制造业结束连续两个月的萎缩，但比经济师普遍预估的增长<em>3.3</em>％疲软得多。这也意味着，我国今年第一季度的经济很可能让人失望 \n
> > "]},
> >
> >     "chinese2":{
> >
> >       "text":["Zheng <em>Lin</em> <em>Yeo</em>"]},
> >
> >     "chinese3":{
> >
> >       "text":["Zheng <em>Lin</em> <em>Yeo</em>"]},
> >
> >     "chinese4":{
> >
> >       "text":["户只要订购《联合晚报》任一种配套，就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
> > <em>199</em>元的Lenovo <em>TAB</em> 2
> A7-10七寸平板电脑，或者一架价值<em>249</em>元的Philips
> > Viva"]},
> >
> >     "chinese5":{
> >
> >       "text":["Zheng <em>Lin</em> <em>Yeo</em>"]}}}
> >
> >
> >
> > Why is this so?
> >
> >
> > Regards,
> >
> > Edwin
> >
> >
> > 2015-06-25 18:54 GMT+08:00 Markus Jelsma <markus.jel...@openindex.io>:
> >
> > > You may also want to try Paoding if you have enough time to spend:
> > > https://github.com/cslinmiso/paoding-analysis
> > >
> > > -----Original message-----
> > > > From:Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > > > Sent: Thursday 25th June 2015 11:38
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Tokenizer and Filter Factory to index Chinese characters
> > > >
> > > > Hi, The result doesn't seems that good as well. But you're not using
> > the
> > > > HMMChineseTokenizerFactory?
> > > >
> > > > The output below is from the filters you've shown me.
> > > >
> > > >   "highlighting":{
> > > >     "chinese1":{
> > > >       "id":["chinese1"],
> > > >       "title":["<em>我国</em>1<em>月份的制造业产值同比仅增长</em>0"],
> > > >
> > >
> >
> "content":["，<em>但比经济师普遍预估的增长</em>3.3％<em>疲软得多</em>。<em>这也意味着</em>，<em>我国今年第一季度的经济很可能让人失望</em>
> > > > \n  "],
> > > >       "author":["<em>Edwin</em>"]},
> > > >     "chinese2":{
> > > >       "id":["chinese2"],
> > > >       "content":["<em>铜牌</em>，<em>让我国暂时高居奖牌荣誉榜榜首</em>。
> > > > <em>你看好新加坡在本届的东运会中</em>，<em>会夺得多少面金牌</em>？
> > > > <em>请在</em>6月<em>12</em><em>日中午前</em>，<em>投票并留言为我国健将寄上祝语吧</em>  \n
> > > > "],
> > > >       "author":["<em>Edwin</em>"]},
> > > >     "chinese3":{
> > > >       "id":["chinese3"],
> > > >       "content":[")<em>组成的我国女队在今天的东运会保龄球女子三人赛中</em>，
> > > >
> > >
> >
> <em>以六局</em>3963<em>总瓶分夺冠</em>，<em>为新加坡赢得本届赛会第三枚金牌</em>。<em>队友陈诗桦</em>（Jazreel)、<em>梁蕙芬和陈诗静以</em>3707<em>总瓶分获得亚军</em>，<em>季军归菲律宾女队</em>。（<em>联合早报记者</em>：<em>郭嘉惠</em>)
> > > > \n  "],
> > > >       "author":["<em>Edwin</em>"]},
> > > >     "chinese4":{
> > > >       "id":["chinese4"],
> > > >       "content":["，<em>则可获得一架价值</em>309<em>元的</em>Philips Viva
> > > > Collection HD9045<em>面包机</em>。 \n
> > > > <em>欲订从速</em>，<em>读者可登陆</em>www.wbsub.com.sg，<em>或拨打客服专线</em>6319
> > > > 1800<em>订购</em>。 \n
> > > >
> > >
> >
> <em>此外</em>，<em>一年一度的晚报保健美容展</em>，<em>将在本月</em><em>23</em><em>日和</em><em>24</em>日，<em>在新达新加坡会展中心</em>401、402<em>展厅举行</em>。
> > > > \n
> > >
> >
> <em>现场将开设</em>《<em>联合晚报</em>》<em>订阅展摊</em>，<em>读者当场订阅晚报</em>，<em>除了可获得丰厚的赠品</em>，<em>还有机会参与</em>“<em>必胜</em>”<em>幸运抽奖</em>"],
> > > >       "author":["<em>Edwin</em>"]}}}
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > 2015-06-25 17:28 GMT+08:00 Markus Jelsma <markus.jel...@openindex.io
> >:
> > > >
> > > > > Hi - we are actually using some other filters for Chinese, although
> > > they
> > > > > are not specialized for Chinese:
> > > > >
> > > > >         <tokenizer class="solr.StandardTokenizerFactory"/>
> > > > >         <filter class="solr.CJKWidthFilterFactory"/>
> > > > >         <filter class="solr.LowerCaseFilterFactory"/>
> > > > >         <filter class="solr.CJKBigramFilterFactory"/>
> > > > >
> > > > >
> > > > > -----Original message-----
> > > > > > From:Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > > > > > Sent: Thursday 25th June 2015 11:24
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: Re: Tokenizer and Filter Factory to index Chinese
> > characters
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > I've tried that, but when I do a search, it's returning much more
> > > > > > highlighted results that what it supposed to.
> > > > > >
> > > > > > For example, if I enter the following query:
> > > > > > http://localhost:8983/solr/chinese1/highlight?q=我国
> > > > > >
> > > > > > I get the following results:
> > > > > >
> > > > > > "highlighting":{
> > > > > >     "chinese1":{
> > > > > >       "id":["chinese1"],
> > > > > >
> > > > >
> > >
> >
> "title":["<em>我国</em>1<em>月份</em>的制造业<em>产值</em><em>同比</em>仅<em>增长</em>0"],
> > > > > >
> > > > >
> > >
> >
> "content":["<em>结束</em><em>连续</em>两个月的<em>萎缩</em>，但比经济师<em>普遍</em><em>预估</em>的<em>增长</em>3.3％<em>疲软</em>得多。这也意味着，<em>我国</em><em>今年</em><em>第一</em><em>季度</em>的<em>经济</em>很<em>可能</em>让人<em>失望</em>
> > > > > > \n  "],
> > > > > >       "author":["<em>Edwin</em>"]},
> > > > > >     "chinese2":{
> > > > > >       "id":["chinese2"],
> > > > > >
> > > > >
> > >
> >
> "content":["<em>铜牌</em>，让<em>我国</em><em>暂时</em><em>高居</em><em>奖牌</em><em>荣誉</em>榜<em>榜首</em>。
> > > > > > 你看好新加坡在本届的东运会中，会<em>夺得</em><em>多少</em>面<em>金牌</em>？
> > > > > >
> > > > >
> > >
> >
> 请在6月<em>12</em>日<em>中午</em>前，<em>投票</em>并<em>留言</em>为<em>我国</em><em>健将</em>寄上<em>祝语</em>吧
> > > > > >  \n  "],
> > > > > >       "author":["<em>Edwin</em>"]},
> > > > > >     "chinese3":{
> > > > > >       "id":["chinese3"],
> > > > > >
> > > > >
> > >
> >
> "content":[")<em>组成</em>的<em>我国</em><em>女队</em>在<em>今天</em>的东运会保龄球<em>女子</em>三人赛中，
> > > > > >
> > > > >
> > >
> >
> 以六局3963总瓶分<em>夺冠</em>，为新加坡<em>赢得</em><em>本届</em><em>赛会</em>第三枚<em>金牌</em>。<em>队友</em>陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分<em>获得</em><em>亚军</em>，<em>季军</em>归菲律宾<em>女队</em>。（<em>联合</em><em>早报</em><em>记者</em>：郭嘉惠)
> > > > > > \n  "],
> > > > > >       "author":["<Edwin"]},
> > > > > >     "chinese4":{
> > > > > >       "id":["chinese4"],
> > > > > >
> > > > >
> > >
> >
> "content":["<em>配套</em>的<em>读者</em>，则可<em>获得</em>一架<em>价值</em>309元的Philips
> > > > > > Viva Collection <em>HD</em>9045面<em>包机</em>。 \n
> > > > > > 欲订从速，<em>读者</em>可<em>登陆</em>www.wbsub.com
> > > > > .<em>sg</em>，或拨打客服<em>专线</em>6319
> > > > > > 1800<em>订购</em>。 \n
> > > > > >
> > > > >
> > >
> >
> <em>此外</em>，一年一度的<em>晚报</em><em>保健</em><em>美容</em>展，将在<em>本月</em><em>23</em>日和<em>24</em>日，在新达新加坡<em>会展</em><em>中心</em>401、402<em>展厅</em><em>举行</em>。
> > > > > > \n
> > > > >
> > >
> >
> <em>现场</em>将<em>开设</em>《<em>联合</em><em>晚报</em>》<em>订阅</em>展摊，<em>读者</em><em>当场</em><em>订阅</em><em>晚报</em>，<em>除了</em>可<em>获得</em><em>丰厚</em>的<em>赠品</em>，还有<em>机会</em><em>参与</em>“"],
> > > > > >       "author":["<em>Edwin</em>"]}}}
> > > > > >
> > > > > >
> > > > > > Is there any suitable filter factory to solve this issue?
> > > > > >
> > > > > > I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
> > > > > > and StopFilterFactory, but there's no improvement in the search
> > > results.
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > > >
> > > > > >
> > > > > > On 25 June 2015 at 17:17, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > > > wrote:
> > > > > >
> > > > > > > Hello - you can use HMMChineseTokenizerFactory instead.
> > > > > > >
> > > > > > >
> > > > >
> > >
> >
> http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
> > > > > > >
> > > > > > > -----Original message-----
> > > > > > > > From:Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > > > > > > > Sent: Thursday 25th June 2015 11:02
> > > > > > > > To: solr-user@lucene.apache.org
> > > > > > > > Subject: Tokenizer and Filter Factory to index Chinese
> > characters
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Does anyone knows what is the correct replacement for these 2
> > > > > tokenizer
> > > > > > > and
> > > > > > > > filter factory to index chinese into Solr?
> > > > > > > > - SmartChineseSentenceTokenizerFactory
> > > > > > > > - SmartChineseWordTokenFilterFactory
> > > > > > > >
> > > > > > > > I understand that these 2 tokenizer and filter factory are
> > > already
> > > > > > > > deprecated in Solr 5.1, but I can't seem to find the correct
> > > > > replacement.
> > > > > > > >
> > > > > > > >
> > > > > > > > <fieldType name="text_smartcn" class="solr.TextField"
> > > > > > > > positionIncrementGap="0">
> > > > > > > >           <analyzer type="index">
> > > > > > > >             <tokenizer
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
> class="org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory"/>
> > > > > > > >             <filter
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
> class="org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory"/>
> > > > > > > >           </analyzer>
> > > > > > > >           <analyzer type="query">
> > > > > > > >             <tokenizer
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
> class="org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory"/>
> > > > > > > >             <filter
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
> class="org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory"/>
> > > > > > > >           </analyzer>
> > > > > > > > </fieldType>
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Edwin
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Tokenizer and Filter Factory to index Chinese characters

Reply via email to