RE: Tokenizer and Filter Factory to index Chinese characters
Yes, but it is a small change :) M. -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Tuesday 7th July 2015 4:50 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters So we have to recompile the analysers ourselves before we can use it in 5.x? Regards, Edwin On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote: Yes, analyzers slightly changed since 5.x. https://issues.apache.org/jira/browse/LUCENE-5388 -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Monday 6th July 2015 12:31 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Yes, I tried that also, but I faced some compatibility issues with Solr 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x versions. I got the following error when I tried to start Solr with Paoding configured: java.lang.VerifyError: class net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Regards, Edwin 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https
Re: Tokenizer and Filter Factory to index Chinese characters
characters directly into the URL, the results I get are wrong. http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text highlighting:{ chinese1:{ text:[1月份的制造业产值同比仅增长0 \n \n 新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n ]}, chinese2:{ text:[Zheng emLin/em emYeo/em]}, chinese3:{ text:[Zheng emLin/em emYeo/em]}, chinese4:{ text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips Viva]}, chinese5:{ text:[Zheng emLin/em emYeo/em]}}} Why is this so? Regards, Edwin 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io : Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com .emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅
RE: Tokenizer and Filter Factory to index Chinese characters
Yes, analyzers slightly changed since 5.x. https://issues.apache.org/jira/browse/LUCENE-5388 -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Monday 6th July 2015 12:31 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Yes, I tried that also, but I faced some compatibility issues with Solr 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x versions. I got the following error when I tried to start Solr with Paoding configured: java.lang.VerifyError: class net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Regards, Edwin 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code before passing it to the URL, so it looks something like this: http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true highlighting:{ chinese5:{ text:[园将办系列活动庆祝入遗 \n
Re: Tokenizer and Filter Factory to index Chinese characters
Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code before passing it to the URL, so it looks something like this: http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true highlighting:{ chinese5:{ text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音 乐会,为庆祝申遗成功,植物园这个月起将举办一系列活动与公众一同庆贺。 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览,将展出1万 6000株em胡姬花/em,这是]}, chinese3:{ text:[ \n 原版为 马来语 《Majulah Singapura》,中文译为《 前 进吧,新加坡 》。 \n \n \t 国花 \n 新加坡以一种名为 卓 锦 · 万代 兰 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}} However, if I enter the Chinese characters directly into the URL, the results I get are wrong. http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text highlighting:{ chinese1:{ text:[1月份的制造业产值同比仅增长0 \n \n 新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n ]}, chinese2:{ text:[Zheng emLin/em emYeo/em]}, chinese3:{ text:[Zheng emLin/em emYeo/em]}, chinese4:{ text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips Viva]}, chinese5:{ text:[Zheng emLin/em emYeo/em]}}} Why is this so? Regards, Edwin 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国
Re: Tokenizer and Filter Factory to index Chinese characters
So we have to recompile the analysers ourselves before we can use it in 5.x? Regards, Edwin On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote: Yes, analyzers slightly changed since 5.x. https://issues.apache.org/jira/browse/LUCENE-5388 -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Monday 6th July 2015 12:31 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Yes, I tried that also, but I faced some compatibility issues with Solr 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x versions. I got the following error when I tried to start Solr with Paoding configured: java.lang.VerifyError: class net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Regards, Edwin 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code
Re: Tokenizer and Filter Factory to index Chinese characters
I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code before passing it to the URL, so it looks something like this: http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true highlighting:{ chinese5:{ text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音 乐会,为庆祝申遗成功,植物园这个月起将举办一系列活动与公众一同庆贺。 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览,将展出1万 6000株em胡姬花/em,这是]}, chinese3:{ text:[ \n 原版为 马来语 《Majulah Singapura》,中文译为《 前 进吧,新加坡 》。 \n \n \t 国花 \n 新加坡以一种名为 卓 锦 · 万代 兰 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}} However, if I enter the Chinese characters directly into the URL, the results I get are wrong. http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text highlighting:{ chinese1:{ text:[1月份的制造业产值同比仅增长0 \n \n 新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n ]}, chinese2:{ text:[Zheng emLin/em emYeo/em]}, chinese3:{ text:[Zheng emLin/em emYeo/em]}, chinese4:{ text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips Viva]}, chinese5:{ text:[Zheng emLin/em emYeo/em]}}} Why is this so? Regards, Edwin 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套
RE: Tokenizer and Filter Factory to index Chinese characters
You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com .emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1
Re: Tokenizer and Filter Factory to index Chinese characters
Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com.emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
RE: Tokenizer and Filter Factory to index Chinese characters
Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com.emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
RE: Tokenizer and Filter Factory to index Chinese characters
Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
Re: Tokenizer and Filter Factory to index Chinese characters
Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com .emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer