[ 
http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12445512 ] 
            
Enis Soztutar commented on NUTCH-389:
-------------------------------------

Otis you can test the tokenizer using the TestUrlTokenizer junit test case. And 
you cab test the NutchDocumentTokenizer by running the NutchDocumentTokenizer's 
main method. 

NutchDocumentTokzenizer tokenizes http://www.foo_bar.com/baz_bar?car&dar_mar as 

    http www foo_bar com baz_bar car&dar_mar


whereas urlTokzenizer tokenizes the above url as

    http www foo bar com baz bar car dar mar

so it will hit the queries "baz", "bar","car". "dar" and "mar" as well.

for the url 
http://www.google.com.tr/firefox?client=firefox-a&rls=org.mozilla:en-US:official

NutchDocumentTokenizer gives tokens : http www google com tr firefox client 
firefox a&rls org mozilla en us official
urlTokenizer gives tokens : http www google com tr firefox client firefox a rls 
org mozilla en US official 



> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
> seperators, which is in the case of the urls not appropriate. So i have 
> written a url tokenizer which the tokens that match the regular exp 
> [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
> which describes the grammer for URIs, URL's can be tokenized with the above 
> expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
> "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to