RE: Tokenizer and Filter Factory to index Chinese characters

2015-07-07 Thread Markus Jelsma
Yes, but it is a small change :)
M.

 
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Tuesday 7th July 2015 4:50
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 So we have to recompile the analysers ourselves before we can use it in 5.x?
 
 Regards,
 Edwin
 
 On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote:
 
  Yes, analyzers slightly changed since 5.x.
  https://issues.apache.org/jira/browse/LUCENE-5388
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Monday 6th July 2015 12:31
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizer and Filter Factory to index Chinese characters
  
   Yes, I tried that also, but I faced some compatibility issues with Solr
   5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
   versions.
  
   I got the following error when I tried to start Solr with Paoding
   configured:
  
   java.lang.VerifyError: class
   net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
   method
  tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(Unknown Source)
 at java.security.SecureClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.access$100(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at
  org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
 at
  org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(Unknown Source)
 at java.security.SecureClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.access$100(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at
  org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Unknown Source)
 at
  org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
 at
  org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
 at
  org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
 at
  org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
 at
  org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
 at
  org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 at
  org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
 at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175)
 at
  org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
 at
  org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
 at
  org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
 at
  org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
 at
  org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
 at
  org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
 at
  org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
  Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
  Source)
 at java.lang.Thread.run(Unknown Source)
  
  
  
   Regards,
   Edwin
  
  
   2015-07-06 16:37 GMT+08:00 davidphilip cherian 
  davidphilipcher...@gmail.com
   :
  
Hi Edwin,
   
Have you tried the Paoding analyzer?  It is not out of the box shipped
  with
Solr jars. You may have to download it and add it to solr libs.
   
https

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Zheng Lin Edwin Yeo
 characters directly into the URL, the
  results I get are wrong.
 
  http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text
 
 
highlighting:{
 
  chinese1:{
 
text:[1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。
  虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n
  ]},
 
  chinese2:{
 
text:[Zheng emLin/em emYeo/em]},
 
  chinese3:{
 
text:[Zheng emLin/em emYeo/em]},
 
  chinese4:{
 
text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
  em199/em元的Lenovo emTAB/em 2
 A7-10七寸平板电脑,或者一架价值em249/em元的Philips
  Viva]},
 
  chinese5:{
 
text:[Zheng emLin/em emYeo/em]}}}
 
 
 
  Why is this so?
 
 
  Regards,
 
  Edwin
 
 
  2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
 
   You may also want to try Paoding if you have enough time to spend:
   https://github.com/cslinmiso/paoding-analysis
  
   -Original message-
From:Zheng Lin Edwin Yeo edwinye...@gmail.com
Sent: Thursday 25th June 2015 11:38
To: solr-user@lucene.apache.org
Subject: Re: Tokenizer and Filter Factory to index Chinese characters
   
Hi, The result doesn't seems that good as well. But you're not using
  the
HMMChineseTokenizerFactory?
   
The output below is from the filters you've shown me.
   
  highlighting:{
chinese1:{
  id:[chinese1],
  title:[em我国/em1em月份的制造业产值同比仅增长/em0],
   
  
 
 content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
  content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。
em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em?
em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em  \n
],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
  content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em,
   
  
 
 em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em)
\n  ],
  author:[emEdwin/em]},
chinese4:{
  id:[chinese4],
  content:[,em则可获得一架价值/em309em元的/emPhilips Viva
Collection HD9045em面包机/em。 \n
em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319
1800em订购/em。 \n
   
  
 
 em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。
\n
  
 
 em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em],
  author:[emEdwin/em]}}}
   
   
Regards,
Edwin
   
   
2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io
 :
   
 Hi - we are actually using some other filters for Chinese, although
   they
 are not specialized for Chinese:

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CJKBigramFilterFactory/


 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:24
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese
  characters
 
  Thank you.
 
  I've tried that, but when I do a search, it's returning much more
  highlighted results that what it supposed to.
 
  For example, if I enter the following query:
  http://localhost:8983/solr/chinese1/highlight?q=我国
 
  I get the following results:
 
  highlighting:{
  chinese1:{
id:[chinese1],
 

  
 
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
 

  
 
 content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
  \n  ],
author:[emEdwin/em]},
  chinese2:{
id:[chinese2],
 

  
 
 content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
  你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?
 

  
 
 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
   \n  ],
author:[emEdwin/em]},
  chinese3:{
id:[chinese3],
 

  
 
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中,
 

  
 
 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠)
  \n  ],
author:[Edwin]},
  chinese4:{
id:[chinese4],
 

  
 
 content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips
  Viva Collection emHD/em9045面em包机/em。 \n
  欲订从速,em读者/em可em登陆/emwww.wbsub.com
 .emsg/em,或拨打客服em专线/em6319
  1800em订购/em。 \n
 

  
 
 em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
  \n

  
 
 em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅

RE: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Markus Jelsma
Yes, analyzers slightly changed since 5.x.
https://issues.apache.org/jira/browse/LUCENE-5388
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Monday 6th July 2015 12:31
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 Yes, I tried that also, but I faced some compatibility issues with Solr
 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
 versions.
 
 I got the following error when I tried to start Solr with Paoding
 configured:
 
 java.lang.VerifyError: class
 net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
 method 
 tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClass(Unknown Source)
   at java.security.SecureClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.access$100(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at 
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
   at 
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClass(Unknown Source)
   at java.security.SecureClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.access$100(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at 
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Unknown Source)
   at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
   at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
   at 
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
   at 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
   at 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
   at 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
   at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
   at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175)
   at 
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
   at 
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
   at 
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
   at 
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
   at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
   at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
   at java.util.concurrent.FutureTask.run(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   at java.lang.Thread.run(Unknown Source)
 
 
 
 Regards,
 Edwin
 
 
 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com
 :
 
  Hi Edwin,
 
  Have you tried the Paoding analyzer?  It is not out of the box shipped with
  Solr jars. You may have to download it and add it to solr libs.
 
  https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding
 
 
 
  2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com:
 
   I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
   characters can work when I use the Query tab in Solr Admin UI.
  
   In the Admin UI, it converts the Chinese characters to code before
  passing
   it to the URL, so it looks something like this:
  
  
  http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true
  
   highlighting:{
  
   chinese5:{
  
 text:[园将办系列活动庆祝入遗 \n

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread davidphilip cherian
Hi Edwin,

Have you tried the Paoding analyzer?  It is not out of the box shipped with
Solr jars. You may have to download it and add it to solr libs.

https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding



2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
 characters can work when I use the Query tab in Solr Admin UI.

 In the Admin UI, it converts the Chinese characters to code before passing
 it to the URL, so it looks something like this:

 http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true

 highlighting:{

 chinese5:{

   text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音
 乐会,为庆祝申遗成功,植物园这个月起将举办一系列活动与公众一同庆贺。
 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览,将展出1万
 6000株em胡姬花/em,这是]},

 chinese3:{

   text:[ \n 原版为 马来语 《Majulah Singapura》,中文译为《 前  进吧,新加坡 》。 \n  \n
 \t  国花 \n 新加坡以一种名为 卓  锦  ·  万代  兰
 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}}



 However, if I enter the Chinese characters directly into the URL, the
 results I get are wrong.

 http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text


   highlighting:{

 chinese1:{

   text:[1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。
 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n
 ]},

 chinese2:{

   text:[Zheng emLin/em emYeo/em]},

 chinese3:{

   text:[Zheng emLin/em emYeo/em]},

 chinese4:{

   text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips
 Viva]},

 chinese5:{

   text:[Zheng emLin/em emYeo/em]}}}



 Why is this so?


 Regards,

 Edwin


 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:

  You may also want to try Paoding if you have enough time to spend:
  https://github.com/cslinmiso/paoding-analysis
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Thursday 25th June 2015 11:38
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizer and Filter Factory to index Chinese characters
  
   Hi, The result doesn't seems that good as well. But you're not using
 the
   HMMChineseTokenizerFactory?
  
   The output below is from the filters you've shown me.
  
 highlighting:{
   chinese1:{
 id:[chinese1],
 title:[em我国/em1em月份的制造业产值同比仅增长/em0],
  
 
 content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em
   \n  ],
 author:[emEdwin/em]},
   chinese2:{
 id:[chinese2],
 content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。
   em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em?
   em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em  \n
   ],
 author:[emEdwin/em]},
   chinese3:{
 id:[chinese3],
 content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em,
  
 
 em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em)
   \n  ],
 author:[emEdwin/em]},
   chinese4:{
 id:[chinese4],
 content:[,em则可获得一架价值/em309em元的/emPhilips Viva
   Collection HD9045em面包机/em。 \n
   em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319
   1800em订购/em。 \n
  
 
 em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。
   \n
 
 em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em],
 author:[emEdwin/em]}}}
  
  
   Regards,
   Edwin
  
  
   2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
  
Hi - we are actually using some other filters for Chinese, although
  they
are not specialized for Chinese:
   
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.CJKWidthFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CJKBigramFilterFactory/
   
   
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:24
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese
 characters

 Thank you.

 I've tried that, but when I do a search, it's returning much more
 highlighted results that what it supposed to.

 For example, if I enter the following query:
 http://localhost:8983/solr/chinese1/highlight?q=我国

 I get the following results:

 highlighting:{
 chinese1:{
   id:[chinese1],

   
 
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],

   
 
 content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
 \n  ],
   author:[emEdwin/em]},
 chinese2:{
   id:[chinese2],

   
 
 content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?

   
 
 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Zheng Lin Edwin Yeo
So we have to recompile the analysers ourselves before we can use it in 5.x?

Regards,
Edwin

On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote:

 Yes, analyzers slightly changed since 5.x.
 https://issues.apache.org/jira/browse/LUCENE-5388

 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Monday 6th July 2015 12:31
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
  Yes, I tried that also, but I faced some compatibility issues with Solr
  5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
  versions.
 
  I got the following error when I tried to start Solr with Paoding
  configured:
 
  java.lang.VerifyError: class
  net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
  method
 tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
at
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
at
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175)
at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
at
 org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
at
 org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
at java.lang.Thread.run(Unknown Source)
 
 
 
  Regards,
  Edwin
 
 
  2015-07-06 16:37 GMT+08:00 davidphilip cherian 
 davidphilipcher...@gmail.com
  :
 
   Hi Edwin,
  
   Have you tried the Paoding analyzer?  It is not out of the box shipped
 with
   Solr jars. You may have to download it and add it to solr libs.
  
   https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding
  
  
  
   2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com:
  
I'm now using the solr.ICUTokenizerFactory, and the searching for
 Chinese
characters can work when I use the Query tab in Solr Admin UI.
   
In the Admin UI, it converts the Chinese characters to code

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Zheng Lin Edwin Yeo
I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
characters can work when I use the Query tab in Solr Admin UI.

In the Admin UI, it converts the Chinese characters to code before passing
it to the URL, so it looks something like this:
http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true

highlighting:{

chinese5:{

  text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音
乐会,为庆祝申遗成功,植物园这个月起将举办一系列活动与公众一同庆贺。 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览,将展出1万
6000株em胡姬花/em,这是]},

chinese3:{

  text:[ \n 原版为 马来语 《Majulah Singapura》,中文译为《 前  进吧,新加坡 》。 \n  \n
\t  国花 \n 新加坡以一种名为 卓  锦  ·  万代  兰 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}}



However, if I enter the Chinese characters directly into the URL, the
results I get are wrong.

http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text


  highlighting:{

chinese1:{

  text:[1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。
虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n
]},

chinese2:{

  text:[Zheng emLin/em emYeo/em]},

chinese3:{

  text:[Zheng emLin/em emYeo/em]},

chinese4:{

  text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips
Viva]},

chinese5:{

  text:[Zheng emLin/em emYeo/em]}}}



Why is this so?


Regards,

Edwin


2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:

 You may also want to try Paoding if you have enough time to spend:
 https://github.com/cslinmiso/paoding-analysis

 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:38
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
  Hi, The result doesn't seems that good as well. But you're not using the
  HMMChineseTokenizerFactory?
 
  The output below is from the filters you've shown me.
 
highlighting:{
  chinese1:{
id:[chinese1],
title:[em我国/em1em月份的制造业产值同比仅增长/em0],
 
  
 content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em
  \n  ],
author:[emEdwin/em]},
  chinese2:{
id:[chinese2],
content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。
  em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em?
  em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em  \n
  ],
author:[emEdwin/em]},
  chinese3:{
id:[chinese3],
content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em,
 
 em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em)
  \n  ],
author:[emEdwin/em]},
  chinese4:{
id:[chinese4],
content:[,em则可获得一架价值/em309em元的/emPhilips Viva
  Collection HD9045em面包机/em。 \n
  em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319
  1800em订购/em。 \n
 
 em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。
  \n
 em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em],
author:[emEdwin/em]}}}
 
 
  Regards,
  Edwin
 
 
  2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
 
   Hi - we are actually using some other filters for Chinese, although
 they
   are not specialized for Chinese:
  
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.CJKWidthFilterFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.CJKBigramFilterFactory/
  
  
   -Original message-
From:Zheng Lin Edwin Yeo edwinye...@gmail.com
Sent: Thursday 25th June 2015 11:24
To: solr-user@lucene.apache.org
Subject: Re: Tokenizer and Filter Factory to index Chinese characters
   
Thank you.
   
I've tried that, but when I do a search, it's returning much more
highlighted results that what it supposed to.
   
For example, if I enter the following query:
http://localhost:8983/solr/chinese1/highlight?q=我国
   
I get the following results:
   
highlighting:{
chinese1:{
  id:[chinese1],
   
  
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
   
  
 content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
   
  
 content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?
   
  
 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
 \n  ],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
   
  
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中,
   
  
 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠)
\n  ],
  author:[Edwin]},
chinese4:{
  id:[chinese4],
   
  
 content:[em配套

RE: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Markus Jelsma
You may also want to try Paoding if you have enough time to spend:
https://github.com/cslinmiso/paoding-analysis
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:38
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 Hi, The result doesn't seems that good as well. But you're not using the
 HMMChineseTokenizerFactory?
 
 The output below is from the filters you've shown me.
 
   highlighting:{
 chinese1:{
   id:[chinese1],
   title:[em我国/em1em月份的制造业产值同比仅增长/em0],
   
 content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em
 \n  ],
   author:[emEdwin/em]},
 chinese2:{
   id:[chinese2],
   content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。
 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em?
 em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em  \n
 ],
   author:[emEdwin/em]},
 chinese3:{
   id:[chinese3],
   content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em,
 em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em)
 \n  ],
   author:[emEdwin/em]},
 chinese4:{
   id:[chinese4],
   content:[,em则可获得一架价值/em309em元的/emPhilips Viva
 Collection HD9045em面包机/em。 \n
 em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319
 1800em订购/em。 \n
 em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。
 \n 
 em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em],
   author:[emEdwin/em]}}}
 
 
 Regards,
 Edwin
 
 
 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
 
  Hi - we are actually using some other filters for Chinese, although they
  are not specialized for Chinese:
 
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.CJKWidthFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.CJKBigramFilterFactory/
 
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Thursday 25th June 2015 11:24
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizer and Filter Factory to index Chinese characters
  
   Thank you.
  
   I've tried that, but when I do a search, it's returning much more
   highlighted results that what it supposed to.
  
   For example, if I enter the following query:
   http://localhost:8983/solr/chinese1/highlight?q=我国
  
   I get the following results:
  
   highlighting:{
   chinese1:{
 id:[chinese1],
  
   title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
  
   
  content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
   \n  ],
 author:[emEdwin/em]},
   chinese2:{
 id:[chinese2],
  
   
  content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
   你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?
  
  请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
\n  ],
 author:[emEdwin/em]},
   chinese3:{
 id:[chinese3],
  
   
  content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中,
  
  以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠)
   \n  ],
 author:[Edwin]},
   chinese4:{
 id:[chinese4],
  
   content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips
   Viva Collection emHD/em9045面em包机/em。 \n
   欲订从速,em读者/em可em登陆/emwww.wbsub.com
  .emsg/em,或拨打客服em专线/em6319
   1800em订购/em。 \n
  
  em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
   \n
  em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“],
 author:[emEdwin/em]}}}
  
  
   Is there any suitable filter factory to solve this issue?
  
   I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
   and StopFilterFactory, but there's no improvement in the search results.
  
  
   Regards,
   Edwin
  
  
   On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io
  wrote:
  
Hello - you can use HMMChineseTokenizerFactory instead.
   
   
  http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
   
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:02
 To: solr-user@lucene.apache.org
 Subject: Tokenizer and Filter Factory to index Chinese characters

 Hi,

 Does anyone knows what is the correct replacement for these 2
  tokenizer
and
 filter factory to index chinese into Solr?
 - SmartChineseSentenceTokenizerFactory
 - SmartChineseWordTokenFilterFactory

 I understand that these 2 tokenizer and filter factory are already
 deprecated in Solr 5.1

Re: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Zheng Lin Edwin Yeo
Thank you.

I've tried that, but when I do a search, it's returning much more
highlighted results that what it supposed to.

For example, if I enter the following query:
http://localhost:8983/solr/chinese1/highlight?q=我国

I get the following results:

highlighting:{
chinese1:{
  id:[chinese1],
  
title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
  
content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
  
content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?
请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
 \n  ],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
  
content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中,
以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠)
\n  ],
  author:[Edwin]},
chinese4:{
  id:[chinese4],
  content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips
Viva Collection emHD/em9045面em包机/em。 \n
欲订从速,em读者/em可em登陆/emwww.wbsub.com.emsg/em,或拨打客服em专线/em6319
1800em订购/em。 \n
em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
\n 
em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“],
  author:[emEdwin/em]}}}


Is there any suitable filter factory to solve this issue?

I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
and StopFilterFactory, but there's no improvement in the search results.


Regards,
Edwin


On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote:

 Hello - you can use HMMChineseTokenizerFactory instead.

 http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html

 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:02
  To: solr-user@lucene.apache.org
  Subject: Tokenizer and Filter Factory to index Chinese characters
 
  Hi,
 
  Does anyone knows what is the correct replacement for these 2 tokenizer
 and
  filter factory to index chinese into Solr?
  - SmartChineseSentenceTokenizerFactory
  - SmartChineseWordTokenFilterFactory
 
  I understand that these 2 tokenizer and filter factory are already
  deprecated in Solr 5.1, but I can't seem to find the correct replacement.
 
 
  fieldType name=text_smartcn class=solr.TextField
  positionIncrementGap=0
analyzer type=index
  tokenizer
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
  filter
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
/analyzer
analyzer type=query
  tokenizer
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
  filter
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
/analyzer
  /fieldType
 
  Thank you.
 
 
  Regards,
  Edwin
 



RE: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Markus Jelsma
Hi - we are actually using some other filters for Chinese, although they are 
not specialized for Chinese:

tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.CJKWidthFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CJKBigramFilterFactory/
 
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:24
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 Thank you.
 
 I've tried that, but when I do a search, it's returning much more
 highlighted results that what it supposed to.
 
 For example, if I enter the following query:
 http://localhost:8983/solr/chinese1/highlight?q=我国
 
 I get the following results:
 
 highlighting:{
 chinese1:{
   id:[chinese1],
   
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
   
 content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
 \n  ],
   author:[emEdwin/em]},
 chinese2:{
   id:[chinese2],
   
 content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?
 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
  \n  ],
   author:[emEdwin/em]},
 chinese3:{
   id:[chinese3],
   
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中,
 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠)
 \n  ],
   author:[Edwin]},
 chinese4:{
   id:[chinese4],
   
 content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips
 Viva Collection emHD/em9045面em包机/em。 \n
 欲订从速,em读者/em可em登陆/emwww.wbsub.com.emsg/em,或拨打客服em专线/em6319
 1800em订购/em。 \n
 em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
 \n 
 em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“],
   author:[emEdwin/em]}}}
 
 
 Is there any suitable filter factory to solve this issue?
 
 I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
 and StopFilterFactory, but there's no improvement in the search results.
 
 
 Regards,
 Edwin
 
 
 On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote:
 
  Hello - you can use HMMChineseTokenizerFactory instead.
 
  http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Thursday 25th June 2015 11:02
   To: solr-user@lucene.apache.org
   Subject: Tokenizer and Filter Factory to index Chinese characters
  
   Hi,
  
   Does anyone knows what is the correct replacement for these 2 tokenizer
  and
   filter factory to index chinese into Solr?
   - SmartChineseSentenceTokenizerFactory
   - SmartChineseWordTokenFilterFactory
  
   I understand that these 2 tokenizer and filter factory are already
   deprecated in Solr 5.1, but I can't seem to find the correct replacement.
  
  
   fieldType name=text_smartcn class=solr.TextField
   positionIncrementGap=0
 analyzer type=index
   tokenizer
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
   filter
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
   filter
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
 /analyzer
   /fieldType
  
   Thank you.
  
  
   Regards,
   Edwin
  
 
 


RE: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Markus Jelsma
Hello - you can use HMMChineseTokenizerFactory instead.
http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html

-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:02
 To: solr-user@lucene.apache.org
 Subject: Tokenizer and Filter Factory to index Chinese characters
 
 Hi,
 
 Does anyone knows what is the correct replacement for these 2 tokenizer and
 filter factory to index chinese into Solr?
 - SmartChineseSentenceTokenizerFactory
 - SmartChineseWordTokenFilterFactory
 
 I understand that these 2 tokenizer and filter factory are already
 deprecated in Solr 5.1, but I can't seem to find the correct replacement.
 
 
 fieldType name=text_smartcn class=solr.TextField
 positionIncrementGap=0
   analyzer type=index
 tokenizer
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
 filter
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
 filter
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
   /analyzer
 /fieldType
 
 Thank you.
 
 
 Regards,
 Edwin
 


Re: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Zheng Lin Edwin Yeo
Hi, The result doesn't seems that good as well. But you're not using the
HMMChineseTokenizerFactory?

The output below is from the filters you've shown me.

  highlighting:{
chinese1:{
  id:[chinese1],
  title:[em我国/em1em月份的制造业产值同比仅增长/em0],
  
content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
  content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。
em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em?
em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em  \n
],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
  content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em,
em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em)
\n  ],
  author:[emEdwin/em]},
chinese4:{
  id:[chinese4],
  content:[,em则可获得一架价值/em309em元的/emPhilips Viva
Collection HD9045em面包机/em。 \n
em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319
1800em订购/em。 \n
em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。
\n 
em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em],
  author:[emEdwin/em]}}}


Regards,
Edwin


2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:

 Hi - we are actually using some other filters for Chinese, although they
 are not specialized for Chinese:

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CJKBigramFilterFactory/


 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:24
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
  Thank you.
 
  I've tried that, but when I do a search, it's returning much more
  highlighted results that what it supposed to.
 
  For example, if I enter the following query:
  http://localhost:8983/solr/chinese1/highlight?q=我国
 
  I get the following results:
 
  highlighting:{
  chinese1:{
id:[chinese1],
 
  title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
 
  
 content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
  \n  ],
author:[emEdwin/em]},
  chinese2:{
id:[chinese2],
 
  
 content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
  你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em?
 
 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
   \n  ],
author:[emEdwin/em]},
  chinese3:{
id:[chinese3],
 
  
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中,
 
 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠)
  \n  ],
author:[Edwin]},
  chinese4:{
id:[chinese4],
 
  content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips
  Viva Collection emHD/em9045面em包机/em。 \n
  欲订从速,em读者/em可em登陆/emwww.wbsub.com
 .emsg/em,或拨打客服em专线/em6319
  1800em订购/em。 \n
 
 em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
  \n
 em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“],
author:[emEdwin/em]}}}
 
 
  Is there any suitable filter factory to solve this issue?
 
  I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
  and StopFilterFactory, but there's no improvement in the search results.
 
 
  Regards,
  Edwin
 
 
  On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
   Hello - you can use HMMChineseTokenizerFactory instead.
  
  
 http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
  
   -Original message-
From:Zheng Lin Edwin Yeo edwinye...@gmail.com
Sent: Thursday 25th June 2015 11:02
To: solr-user@lucene.apache.org
Subject: Tokenizer and Filter Factory to index Chinese characters
   
Hi,
   
Does anyone knows what is the correct replacement for these 2
 tokenizer
   and
filter factory to index chinese into Solr?
- SmartChineseSentenceTokenizerFactory
- SmartChineseWordTokenFilterFactory
   
I understand that these 2 tokenizer and filter factory are already
deprecated in Solr 5.1, but I can't seem to find the correct
 replacement.
   
   
fieldType name=text_smartcn class=solr.TextField
positionIncrementGap=0
  analyzer type=index
tokenizer
   
  
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
filter
   
  
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer