NutchAnalysis segments CJK term word-by-word.In order to make Nutch
support Chinese well, I developed a plug-in for Chinese.
I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
configured plugin.xml and nutch-site.xml. I think nutch should
replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't. What's
wrong ?
* plugin.xml configuration*:
* <?xml version="1.0" encoding="UTF-8"?>*
*<plugin
id="analysis-zh"
name="Chinese Analysis Plug-in"
version="1.0.0"
provider-name="org.apache.nutch">*
* <runtime>
<library name="analysis-zh.jar">
<export name="*"/>
</library>
</runtime>*
* <requires>
<import plugin="nutch-extensionpoints" />
</requires>*
* <extension id="org.apache.nutch.analysis.zh"
name="ChineseAnalyzer"
point="org.apache.nutch.analysis.NutchAnalyzer">*
* <implementation id="ChineseAnalyzer"
class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
<parameter name="lang" value="zh" />
</implementation>*
* </extension>*
*</plugin>*
*Here are some excerpts from nute-site.xml*
*<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
<description> indexing and search plugins.
</description>
</property>*
*Here are some excerpts from the hadoop log:*
*2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Registered Plugins:
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - CyberNeko HTML
Parser (lib-nekohtml)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Site Query Filter
(query-site)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Html Parse Plug-in
(parse-html)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Indexing
Filter (index-basic)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Text Parse Plug-in
(parse-text)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - JavaScript Parser
(parse-js)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Regex URL Filter
(urlfilter-regex)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Query Filter
(query-basic)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - URL Query Filter
(query-url)
2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Chinese Analysis
Plug-in (analysis-zh)*
*......*
*2007-04-02 21:36:26,234 INFO indexer.Indexer - Indexing [
http://2008.163.com/] with analyzer **
[EMAIL PROTECTED]<[EMAIL PROTECTED]>
* (null)
2007-04-02 21:36:26,359 INFO indexer.Indexer - Indexing [
http://auto.163.com/] with analyzer **
[EMAIL PROTECTED]<[EMAIL PROTECTED]>
* (null)*
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers