Modifiying Nutch Indexer
Hi, I need to modify the Nutch Indexer class because for me it is very useful to add some fields to the generated Lucene index. I was trying and I find out that it is possible to add fields to the Document with doc.addField() in the reduce function. My point is that for those fields I need the html content of the webpage to process it, but it looks not to be present yet in the Document because it throws a null pointer exception with getField(content), maybe that is not the correct way to access it, or the correct place. So, How and where can I access to the html content of the document to add a new field to the Lucene Document and so on to the generated index? Any advice will be very helpful, Thanks in advance. Javier.
Re: implement thai lanaguage analyzer in nutch
i think you should learn the javacc ,then understand the analasis.jj then the thai will be resolved soon . just try it On 11/7/06, sanjeev [EMAIL PROTECTED] wrote: Hello, After playing around with nutch for a few months I was tying to implement the thai lanaguage analyzer for nutch. Downloaded the subversion version and compiled using ant - everything fine. Next - I didn't see any tutorial for thai - but i did see one for chinese at http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 Tried following the same steps outlined above but ran into compiler errors ...type mismatch between lucene Token and nutch Token. Suffice to say I am back at square one as far as trying to implement the thai language analyzer for nutch. Can someone please outline for me the exact procedure for this ? Or point me to a tutorial which explains how to ? Would be highly obliged. Thanks. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7214087 Sent from the Nutch - Dev mailing list archive at Nabble.com. -- www.babatu.com
Re: Modifiying Nutch Indexer
Javier P. L. wrote: Hi, I need to modify the Nutch Indexer class because for me it is very useful to add some fields to the generated Lucene index. I was trying and I find out that it is possible to add fields to the Document with doc.addField() in the reduce function. My point is that for those fields I need the html content of the webpage to process it, but it looks not to be present yet in the Document because it throws a null pointer exception with getField(content), maybe that is not the correct way to access it, or the correct place. So, How and where can I access to the html content of the document to add a new field to the Lucene Document and so on to the generated index? Any advice will be very helpful, Thanks in advance. Javier. Hi, You do not need to change the indexer code for adding new fields to the index. You need to implement an indexing filter and add it to your configuration during indexing. You can look at the codes of index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter). IndexingFilter interface has filter() method which takes document, parse, url, CrawlDatum and inlinks as arguments, so you readily have the content of the document to be indexed. You can look at the tutorial on implementing a plugin from the wiki. Best wishes.
[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host
[ http://issues.apache.org/jira/browse/NUTCH-389?page=all ] Enis Soztutar updated NUTCH-389: Attachment: urlTokenizer-improved.diff This is an improvement and a minor bug fix over the previous url tokenizer. This version first replaces characters, which are represented in hexadecimal format in the urls. For example the url file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html will first be converted to file:///tmp/foo baz bar/foo/baz~bar/index.html by replacing the %20 characters with the space. A NullPointerException is corrected in case or input reader returning null for the url. Further improvements on the url tokenization can be discussed here. a url tokenizer implementation for tokenizing index fields : url and host - Key: NUTCH-389 URL: http://issues.apache.org/jira/browse/NUTCH-389 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Reporter: Enis Soztutar Priority: Minor Attachments: urlTokenizer-improved.diff, urlTokenizer.diff NutchAnalysis.jj tokenizes the input by threating and _ as non token seperators, which is in the case of the urls not appropriate. So i have written a url tokenizer which the tokens that match the regular exp [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html which describes the grammer for URIs, URL's can be tokenized with the above expression. NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the url, site and host fields. see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447787 ] Enis Soztutar commented on NUTCH-393: - Also IndexingException is catched by the Indexer, in which case the whole document is not added to the writer (the function returns). Indexer : 334 try { // run indexing filters doc = this.filters.filter(doc, parse, (UTF8)key, fetchDatum, inlinks); } catch (IndexingException e) { if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } IndexingException should be cought in the IndexingFilters.filter(), so that when an IndexingException is thrown in one indexing plugin, the other plugins could still be run. Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: http://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0
porting clustering-carrot2 plugin to carrot2 v2.0 - Key: NUTCH-397 URL: http://issues.apache.org/jira/browse/NUTCH-397 Project: Nutch Issue Type: Improvement Reporter: Do?acan Güney Priority: Trivial A rather trivial port of clustering-carrot2 to new carrot2. I also added the necessary jars for Polish, so that nutch will not give the annoying exceptions when it is initializing clustering-carrot2. There is a small problem, though. AFAICS, a small patch has to be applied to carrot2, otherwise nutch can not start the plugin. (I am also attaching that here.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447939 ] Eelco Lempsink commented on NUTCH-393: -- I'm not sure I agree with that. After running a document through a set of filters you'd expect all filters ran. If not, that's an exception. For instance, your index might depend on all numbers and non-english words being stripped. When one of those filters hits an exception, but the other one runs, your index will become dirty. Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: http://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: implement thai lanaguage analyzer in nutch
Oh btw - I followed the chinese tutorial and was able to compile and everything was fine. Lemme just test if it is working properly - however i didn't make any changes to NutchAnalysis.jj I need more information please. Thanks a bunch. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai lanaguage analyzer in nutch
Hi sanjeev and Kauu I want to support Hindi-Language widely spoken in India language. Can u guide what else I need to modify ? I think there is no support to search and index Hindi language. I want to work on this. But I need some information as what to modify and where eaxctly the changes are require.? Can anybody help me? Thanx. ./Arun On 11/8/06, sanjeev [EMAIL PROTECTED] wrote: Oh btw - I followed the chinese tutorial and was able to compile and everything was fine. Lemme just test if it is working properly - however i didn't make any changes to NutchAnalysis.jj I need more information please. Thanks a bunch. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439 Sent from the Nutch - Dev mailing list archive at Nabble.com.