[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Otis Gospodnetic (JIRA) Tue, 24 Oct 2006 14:37:20 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-389?page=comments#action_12444510 ] 
            
Otis Gospodnetic commented on NUTCH-389:
----------------------------------------


Enis:
Can you give us some examples of how URLs were tokenized before, and how they 
are tokenized with your patch?

For example:

http://www.foo_bar.com/baz_bar?car&dar_mar

How is this tokenized with your patch, and how was it done before?

Thanks.

> a url tokenizer implementation for tokenizing index fields : url and host
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-389
>                 URL: http://issues.apache.org/jira/browse/NUTCH-389
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Priority: Minor
>         Attachments: urlTokenizer.diff
>
>
> NutchAnalysis.jj tokenizes the input by threating & and _ as non token 
> seperators, which is in the case of the urls not appropriate. So i have 
> written a url tokenizer which the tokens that match the regular exp 
> [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
> which describes the grammer for URIs, URL's can be tokenized with the above 
> expression. 
> NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
> "url", "site" and "host" fields.
> see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

Reply via email to