[ http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12445512 ] Enis Soztutar commented on NUTCH-389: -------------------------------------
Otis you can test the tokenizer using the TestUrlTokenizer junit test case. And you cab test the NutchDocumentTokenizer by running the NutchDocumentTokenizer's main method. NutchDocumentTokzenizer tokenizes http://www.foo_bar.com/baz_bar?car&dar_mar as http www foo_bar com baz_bar car&dar_mar whereas urlTokzenizer tokenizes the above url as http www foo bar com baz bar car dar mar so it will hit the queries "baz", "bar","car". "dar" and "mar" as well. for the url http://www.google.com.tr/firefox?client=firefox-a&rls=org.mozilla:en-US:official NutchDocumentTokenizer gives tokens : http www google com tr firefox client firefox a&rls org mozilla en us official urlTokenizer gives tokens : http www google com tr firefox client firefox a rls org mozilla en US official > a url tokenizer implementation for tokenizing index fields : url and host > ------------------------------------------------------------------------- > > Key: NUTCH-389 > URL: http://issues.apache.org/jira/browse/NUTCH-389 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Priority: Minor > Attachments: urlTokenizer.diff > > > NutchAnalysis.jj tokenizes the input by threating & and _ as non token > seperators, which is in the case of the urls not appropriate. So i have > written a url tokenizer which the tokens that match the regular exp > [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html > which describes the grammer for URIs, URL's can be tokenized with the above > expression. > NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the > "url", "site" and "host" fields. > see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira