[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-11-07 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Attachment: urlTokenizer-improved.diff

This is an improvement and a minor bug fix over the previous url tokenizer. 
This version first replaces characters, which are represented in hexadecimal 
format in the urls. 

For example the url file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html will 
first be converted to file:///tmp/foo baz bar/foo/baz~bar/index.html by 
replacing the %20 characters with the space. 

A NullPointerException is corrected in case or input reader returning null for 
the url. 

Further improvements on the url tokenization can be discussed here. 


 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer-improved.diff, urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Attachment: urlTokenizer.diff

patch for url tokenization

 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-10-20 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Description: 
NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 

NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, 
site and host fields.


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

  was:
NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
seperators, which is in the case of the urls not appropriate. So i have written 
a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As 
stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes 
the grammer for URIs, URL's can be tokenized with the above expression. 


see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html


 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira