[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Attachment: urlTokenizer-improved.diff This is an improvement and a minor bug fix over the previous url tokenizer. This version first replaces characters, which are represented in hexadecimal format in the urls. For example the url file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html will first be converted to file:///tmp/foo baz bar/foo/baz~bar/index.html by replacing the %20 characters with the space. A NullPointerException is corrected in case or input reader returning null for the url. Further improvements on the url tokenization can be discussed here. a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer-improved.diff, urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Attachment: urlTokenizer.diff patch for url tokenization a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Description: NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html was: NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira