[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]
Enis Soztutar updated NUTCH-389: -------------------------------- Attachment: urlTokenizer-improved.diff This is an improvement and a minor bug fix over the previous url tokenizer. This version first replaces characters, which are represented in hexadecimal format in the urls. For example the url "file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html" will first be converted to "file:///tmp/foo baz bar/foo/baz~bar/index.html" by replacing the %20 characters with the space. A NullPointerException is corrected in case or input reader returning null for the url. Further improvements on the url tokenization can be discussed here. > a url tokenizer implementation for tokenizing index fields : url and host > ------------------------------------------------------------------------- > > Key: NUTCH-389 > URL: http://issues.apache.org/jira/browse/NUTCH-389 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Priority: Minor > Attachments: urlTokenizer-improved.diff, urlTokenizer.diff > > > NutchAnalysis.jj tokenizes the input by threating & and _ as non token > seperators, which is in the case of the urls not appropriate. So i have > written a url tokenizer which the tokens that match the regular exp > [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html > which describes the grammer for URIs, URL's can be tokenized with the above > expression. > NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the > "url", "site" and "host" fields. > see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira