Vishal Shah wrote:
> Hi,
>
> If I understand correctly, there is a common tokenizer for all fields
> (URL, content, meta etc.). This tokenizer does not use the underscore
> character as a separator. Since a lot of URLs use underscore to separate
> different words, it would be better if the URLs are tokenized slightly
> differently from the other fields. I tried looking at the
> NutchDocumentAnalyzer and related files, but can't figure out a clear
> way to implement a new tokenizer for URLs only. Any ideas as to how to
> go about doing this?
>
> Thanks,
>
> -vishal.
>
>
hi, it is not straightforward to implement this without modifying
default tokenizing behavior,
first you should copy the NutchAnalysis.jj to URLAnalysis.jj (or
something you like) and change
| <#WORD_PUNCT: ("_"|"&")>
to :
| <#WORD_PUNCT: ("&")>
and recompile with javaCC.
then, you should copy NutchDocumentTokenizer to URLTokenizer, and
refactor NutchAnalysisTokenManager instances to URLAnalysisTokenManager
instance,
then you should write an Analyzer like to
private static class URLAnalyzer extends Analyzer {
public URLAnalyzer(){
}
public TokenStream tokenStream(String field, Reader reader) {
return new URLTokenizer(reader);
}
}
and finally, you change NutchDocumentAnalyzer
if ("anchor".equals(fieldName))
analyzer = ANCHOR_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
to
if ("anchor".equals(fieldName))
analyzer = ANCHOR_ANALYZER;
else if("url".equals(fieldName))
analyzer = URL_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
assuming URL_ANALYZER is an instance of URLAnalyzer
I have not tested this but it should work as expected.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general