NutchAnalysis segments CJK term word-by-word.In order to make Nutch
support Chinese well, I developed a plug-in for Chinese.
I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
configured plugin.xml and nutch-site.xml. I think nutch should
replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't. What's
wrong ?
* plugin.xml configuration*:
*  <?xml version="1.0" encoding="UTF-8"?>*

*<plugin
  id="analysis-zh"
  name="Chinese Analysis Plug-in"
  version="1.0.0"
  provider-name="org.apache.nutch">*

*   <runtime>
     <library name="analysis-zh.jar">
        <export name="*"/>
     </library>
  </runtime>*

*   <requires>
     <import plugin="nutch-extensionpoints" />
  </requires>*

*   <extension id="org.apache.nutch.analysis.zh"
             name="ChineseAnalyzer"
             point="org.apache.nutch.analysis.NutchAnalyzer">*

*      <implementation id="ChineseAnalyzer"
                     class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
       <parameter name="lang" value="zh" />
     </implementation>*

*   </extension>*

*</plugin>*

*Here are some excerpts from nute-site.xml*

*<property>
 <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
 <description> indexing and search plugins.
 </description>
</property>*

*Here are some excerpts from the hadoop log:*

*2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository - Registered Plugins:
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  CyberNeko HTML
Parser (lib-nekohtml)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Site Query Filter
(query-site)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Html Parse Plug-in
(parse-html)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
Framework (lib-regex-filter)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Indexing
Filter (index-basic)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Summarizer
Plug-in (summary-basic)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Text Parse Plug-in
(parse-text)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  JavaScript Parser
(parse-js)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Regex URL Filter
(urlfilter-regex)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Basic Query Filter
(query-basic)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  HTTP Framework
(lib-http)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  URL Query Filter
(query-url)
2007-04-02 21:35:32,687 INFO  plugin.PluginRepository -  Chinese Analysis
Plug-in (analysis-zh)*

*......*

*2007-04-02 21:36:26,234 INFO  indexer.Indexer -  Indexing [
http://2008.163.com/] with analyzer **
[EMAIL PROTECTED]<[EMAIL PROTECTED]>
* (null)
2007-04-02 21:36:26,359 INFO  indexer.Indexer -  Indexing [
http://auto.163.com/] with analyzer **
[EMAIL PROTECTED]<[EMAIL PROTECTED]>
* (null)*
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to