The selection of analyzers were triggered by the "lang" property in the doc
object. The lang property of doc were set by the plug-in
LanguageIdentifier.Unfortunately, LanguageIdentifier can't support Chinese now.
If you only need to deal with Chinese documents and English documents,you can
hardcode the lang property of doc to "zh" .In "Indexer.java", modify the code
as blow:
// NutchAnalyzer analyzer = factory.get(doc.get("lang"));
NutchAnalyzer analyzer = factory.get("zh");
----- Original Message -----
From: "zhao xiuwen" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, April 03, 2007 12:33 AM
Subject: Replace CJK lanaguage analyzer in nutch
> NutchAnalysis segments CJK term word-by-word.In order to make Nutch
> support Chinese well, I developed a plug-in for Chinese.
> I placed ChineseAnalyzer.class(extends NutchAnalyzer) and
> ChineseTokenizer.class in a jar-nutch-0.8.1/plugins/analysis-zh.jar, and
> configured plugin.xml and nutch-site.xml. I think nutch should
> replace NutchDocumentAnalyzer by ChineseAnalyzer,but nutch didn't. What's
> wrong ?
> * plugin.xml configuration*:
> * <?xml version="1.0" encoding="UTF-8"?>*
>
> *<plugin
> id="analysis-zh"
> name="Chinese Analysis Plug-in"
> version="1.0.0"
> provider-name="org.apache.nutch">*
>
> * <runtime>
> <library name="analysis-zh.jar">
> <export name="*"/>
> </library>
> </runtime>*
>
> * <requires>
> <import plugin="nutch-extensionpoints" />
> </requires>*
>
> * <extension id="org.apache.nutch.analysis.zh"
> name="ChineseAnalyzer"
> point="org.apache.nutch.analysis.NutchAnalyzer">*
>
> * <implementation id="ChineseAnalyzer"
> class="org.apache.nutch.analysis.zh.ChineseAnalyzer">
> <parameter name="lang" value="zh" />
> </implementation>*
>
> * </extension>*
>
> *</plugin>*
>
> *Here are some excerpts from nute-site.xml*
>
> *<property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|analysis-zh</value>
> <description> indexing and search plugins.
> </description>
> </property>*
>
> *Here are some excerpts from the hadoop log:*
>
> *2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Registered Plugins:
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Site Query Filter
> (query-site)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Html Parse Plug-in
> (parse-html)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Regex URL Filter
> Framework (lib-regex-filter)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Indexing
> Filter (index-basic)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Summarizer
> Plug-in (summary-basic)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Text Parse Plug-in
> (parse-text)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - JavaScript Parser
> (parse-js)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Regex URL Filter
> (urlfilter-regex)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Basic Query Filter
> (query-basic)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - HTTP Framework
> (lib-http)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - URL Query Filter
> (query-url)
> 2007-04-02 21:35:32,687 INFO plugin.PluginRepository - Chinese Analysis
> Plug-in (analysis-zh)*
>
> *......*
>
> *2007-04-02 21:36:26,234 INFO indexer.Indexer - Indexing [
> http://2008.163.com/] with analyzer **
> [EMAIL PROTECTED]<[EMAIL PROTECTED]>
> * (null)
> 2007-04-02 21:36:26,359 INFO indexer.Indexer - Indexing [
> http://auto.163.com/] with analyzer **
> [EMAIL PROTECTED]<[EMAIL PROTECTED]>
> * (null)*
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers