[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372379 ] Andrzej Bialecki commented on NUTCH-240: - Yes, one of the reasons I wanted to discuss these patches is that they uncovered some of the underlying ugliness... ;) The reson for generator store/restore is that scoring plugins could take into account many more variables than just the score recorded in CrawlDatum.score. They could also have different strategies for prioritizing pages to be included in topN. So, it's true this is not currently used by OPIC but I think without this it's not possible for plugins to affect the choice of topN. Initially, I did as you suggest, i.e. I created a method to calculate one float value for the purpose of selecting topN. However, I wanted to avoid changing CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score it seemed to me we should store its earlier value, and then possibl restore - as the value for selecting topN may have nothing to do with the "real" score. passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, but that's what we do at the moment, I just extracted it into an interface. I'd love to skip this altogether, if there is a way. > Scoring API: extension point, scoring filters and an OPIC plugin > > > Key: NUTCH-240 > URL: http://issues.apache.org/jira/browse/NUTCH-240 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Attachments: patch.txt > > This patch refactors all places where Nutch manipulates page scores, into a > plugin-based API. Using this API it's possible to implement different scoring > algorithms. It is also much easier to understand how scoring works. > Multiple scoring plugins can be run in sequence, in a manner similar to > URLFilters. > Included is also an OPICScoringFilter plugin, which contains the current > implementation of the scoring algorithm. Together with the scoring API it > provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] Doug Cutting commented on NUTCH-240: The generator store/restore score stuff seems ugly. And it is not used by OPIC. Could we insteadhave a method that computes and returns a score to be used by the generator? Then it is up to the generator to use this w/o modifying the CrawlDatum. The passScoreBeforeParsing/passScoreAfterParsing/distributeScoreToOutlink protocol also seems awkward, although I don't yet have a suggestion for how to improve it. > Scoring API: extension point, scoring filters and an OPIC plugin > > > Key: NUTCH-240 > URL: http://issues.apache.org/jira/browse/NUTCH-240 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Attachments: patch.txt > > This patch refactors all places where Nutch manipulates page scores, into a > plugin-based API. Using this API it's possible to implement different scoring > algorithms. It is also much easier to understand how scoring works. > Multiple scoring plugins can be run in sequence, in a manner similar to > URLFilters. > Included is also an OPICScoringFilter plugin, which contains the current > implementation of the scoring algorithm. Together with the scoring API it > provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Refactoring some plugins
Jérôme Charron wrote: Moreover, I would like to suggest some other javadoc "improvements" (?): 1. Create a group for abstract plugins (like lib-http or lib-regex-filter) named for instance "Plugins API" +1 2. Create a group for extensions points (As far as I remember, one of the first problem when you want to extend nutch is to found where are the hooks, ie what are the extensions points). One more time, since the javadoc groups are filtered by packages, each extension point interface must be moved to specific package. The idea is then to move all the core extensions points to a new package (for instance org.apache.nutch.api). I'm reluctant to move the extension interface away from the parameter and return value classes used by that interface. Could we instead add a super-interface that all extension-point interfaces extend? That way all of the extension points would be listed in javadoc as implementations of this interface. 3. Create many javadoc plugins groups (one for each major kind of plugin : Indexing, Parsing, Protocol, Query, UrlFilter and Misc for those that cannot be categorized). +1 Doug
Re: Refactoring some plugins
> I don't think it upside down. Plugins should not share packages with > core code, since that would permit them to use package-private APIs. > Also, re-arranging the code to make the javadoc nice is right, since the > javadoc is a primary means of describing the code. Yes, but what I mean is that it is "stange" that it is a documentation issue that raise this need for refactoring. Moreover, I would like to suggest some other javadoc "improvements" (?): 1. Create a group for abstract plugins (like lib-http or lib-regex-filter) named for instance "Plugins API" 2. Create a group for extensions points (As far as I remember, one of the first problem when you want to extend nutch is to found where are the hooks, ie what are the extensions points). One more time, since the javadoc groups are filtered by packages, each extension point interface must be moved to specific package. The idea is then to move all the core extensions points to a new package (for instance org.apache.nutch.api). 3. Create many javadoc plugins groups (one for each major kind of plugin : Indexing, Parsing, Protocol, Query, UrlFilter and Misc for those that cannot be categorized). Thanks for your suggestions and comments. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException
[ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372310 ] Richard Braman commented on NUTCH-220: -- I upgraded nutch .8 trunk to PDFBox HEAD. The NullPointer exception Seems to be resolved by upgrading nutch to PDFBox 0.7.3 The major issues in upgrading nutch to 0.7.3 are: 1. PDFBOx now depends on Font Box, which must be included as a plugin lib-fontbox 2. PDFBox no longer depends on log4j, when I tired to remove references to the dependency in the build.xml for porase-pdf, it returns assorted ant build errors, I left the references to log4j and it built fine someone who has more knowledge of building nutch needs to modify the build and plugin.xml if refernces to log4j should be removed? plugin.xml for FontBox build.xml for lib-fontbox parse-pdf plugin.xml parse-pdf build.xml > PDF Box can't parse document: java.lang.NullPointerException > > > Key: NUTCH-220 > URL: http://issues.apache.org/jira/browse/NUTCH-220 > Project: Nutch > Type: Bug > Environment: PDFBox 0.7.2 > Reporter: Richard Braman > > This error was fixed in the ltest build of PDFBOx, which should be tested > with nutch. > >> 060228 160354 fetch okay, but can't parse > >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason: > >> failed(2,0): Can't be handled as pdf document. > >> java.lang.NullPointerException > Yes, the NPE should be fixed. > Ben > Richard Braman wrote: > > Hi Bn, > > > > We actually got to the bottom of all of them except for 1... The > > content truncatetion was due to an inconsistancy bug in nutch config . > > The no permission to extract text is actually true, the author, the NC > > Department of revenue put this restriction on all of their files (I have > > asked them to remove it as it hampers public accessability). The Null > > pointer exception is the only one to deal with that may be due to the > > parsing bug . Is this one that you are referring to? > > > > -Original Message- > > From: Ben Litchfield [mailto:[EMAIL PROTECTED] > > Sent: Thursday, March 02, 2006 4:07 PM > > To: Richard Braman > > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org; > > [EMAIL PROTECTED] > > Subject: Re: [PDFBox-user] PDF Parse Error > > > > > > > > I believe these errors are due to a parsing bug in PDFBox that has > > been fixed since the 0.7.2 release. Please give the nightly > > build(should be a drop in replacement) a try from > > http://www.pdfbox.org/dist and let me know if you are still having > > issues. > > > > Ben -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException
[ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372277 ] Ben Litchfield commented on NUTCH-220: -- Actually, now that I look at the stack trace, the NPE is not happening in PDFBox code it appears to be in hadoop code, so I don't think that upgrading PDFBox will help. Ben > PDF Box can't parse document: java.lang.NullPointerException > > > Key: NUTCH-220 > URL: http://issues.apache.org/jira/browse/NUTCH-220 > Project: Nutch > Type: Bug > Environment: PDFBox 0.7.2 > Reporter: Richard Braman > > This error was fixed in the ltest build of PDFBOx, which should be tested > with nutch. > >> 060228 160354 fetch okay, but can't parse > >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason: > >> failed(2,0): Can't be handled as pdf document. > >> java.lang.NullPointerException > Yes, the NPE should be fixed. > Ben > Richard Braman wrote: > > Hi Bn, > > > > We actually got to the bottom of all of them except for 1... The > > content truncatetion was due to an inconsistancy bug in nutch config . > > The no permission to extract text is actually true, the author, the NC > > Department of revenue put this restriction on all of their files (I have > > asked them to remove it as it hampers public accessability). The Null > > pointer exception is the only one to deal with that may be due to the > > parsing bug . Is this one that you are referring to? > > > > -Original Message- > > From: Ben Litchfield [mailto:[EMAIL PROTECTED] > > Sent: Thursday, March 02, 2006 4:07 PM > > To: Richard Braman > > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org; > > [EMAIL PROTECTED] > > Subject: Re: [PDFBox-user] PDF Parse Error > > > > > > > > I believe these errors are due to a parsing bug in PDFBox that has > > been fixed since the 0.7.2 release. Please give the nightly > > build(should be a drop in replacement) a try from > > http://www.pdfbox.org/dist and let me know if you are still having > > issues. > > > > Ben -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException
[ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372275 ] Richard Braman commented on NUTCH-220: -- PDFBox-0.7.3 no longer depends on log4j at all, so you should not be getting any log4j errors from PDFBox! Ben On Sun, 26 Mar 2006, Richard Braman wrote: > > Hi Ben, > > I noticed that the nutch uses a log4j version of PDFBox.jar. I don't > > see this as an ant target on 0.7.3 . I downloaded pdfbox from CVS Head. > > > > When I tried to use the PDFBox nightly it gave me a bunch of log4j > > errors, so I guess nutch expects the log4j version. > > > > I am trying to upgrade my nutch to 0.7.3 to see if I can get arid of the > > NPE error. > > > > The NPE bug I told you about a few weeks ago is much worse effect in > > Nutch .8, as it seems to cause the fetcher to abort. > > > > 060326 142450 fetch of > > http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calculation.pdf > > failed with: java.lang.NullPointerException > > java.lang.NullPointerException > > at > > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180) > > at > > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171) > > at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) > > at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:245) > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185) > > 060326 142450 SEVERE fetcher caught:java.lang.NullPointerException > > > > -- > > Richard L Braman, Jr., CPA > > Tax Code Software Foundation, Inc. > > Open Source Tax Software > > http://www.taxcodesoftware.org > > [EMAIL PROTECTED] > > > PDF Box can't parse document: java.lang.NullPointerException > > > Key: NUTCH-220 > URL: http://issues.apache.org/jira/browse/NUTCH-220 > Project: Nutch > Type: Bug > Environment: PDFBox 0.7.2 > Reporter: Richard Braman > > This error was fixed in the ltest build of PDFBOx, which should be tested > with nutch. > >> 060228 160354 fetch okay, but can't parse > >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason: > >> failed(2,0): Can't be handled as pdf document. > >> java.lang.NullPointerException > Yes, the NPE should be fixed. > Ben > Richard Braman wrote: > > Hi Bn, > > > > We actually got to the bottom of all of them except for 1... The > > content truncatetion was due to an inconsistancy bug in nutch config . > > The no permission to extract text is actually true, the author, the NC > > Department of revenue put this restriction on all of their files (I have > > asked them to remove it as it hampers public accessability). The Null > > pointer exception is the only one to deal with that may be due to the > > parsing bug . Is this one that you are referring to? > > > > -Original Message- > > From: Ben Litchfield [mailto:[EMAIL PROTECTED] > > Sent: Thursday, March 02, 2006 4:07 PM > > To: Richard Braman > > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org; > > [EMAIL PROTECTED] > > Subject: Re: [PDFBox-user] PDF Parse Error > > > > > > > > I believe these errors are due to a parsing bug in PDFBox that has > > been fixed since the 0.7.2 release. Please give the nightly > > build(should be a drop in replacement) a try from > > http://www.pdfbox.org/dist and let me know if you are still having > > issues. > > > > Ben -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-48) "Did you mean" query enhancement/refignment feature request
[ http://issues.apache.org/jira/browse/NUTCH-48?page=all ] Aled Rhys Jones updated NUTCH-48: - Attachment: rss-spell.patch Added patch to add spelling correction to the rss feed in the following opensearch format: This patch must be applied after spell-check.patch. > "Did you mean" query enhancement/refignment feature request > > > Key: NUTCH-48 > URL: http://issues.apache.org/jira/browse/NUTCH-48 > Project: Nutch > Type: New Feature > Components: web gui > Environment: All platforms > Reporter: byron miller > Assignee: Sami Siren > Priority: Minor > Attachments: rss-spell.patch, spell-check.patch > > Looking to implement a "Did you mean" feature for query result pages that > return < = x amount of results to invoke a response that would recommend a > fixed/related or spell checked query to try. > Note from Doug to users list: > David Spencer has worked on this some. > http://www.searchmorph.com/weblog/index.php?id=23 > I think the code on his site might be more recent than what's committed > to the lucene/contrib directory. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Closed: (NUTCH-196) lib-xml and lib-log4j plugins
Jerome Charron (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-196?page=all ] Jerome Charron closed NUTCH-196: Fix Version: 0.8-dev Resolution: Fixed Added a lib-xml that gathers many xml libraries previously used in parse-rss. (http://svn.apache.org/viewcvs?rev=389716&view=rev) Thanks! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-196) lib-xml and lib-log4j plugins
[ http://issues.apache.org/jira/browse/NUTCH-196?page=all ] Jerome Charron closed NUTCH-196: Fix Version: 0.8-dev Resolution: Fixed Added a lib-xml that gathers many xml libraries previously used in parse-rss. (http://svn.apache.org/viewcvs?rev=389716&view=rev) > lib-xml and lib-log4j plugins > - > > Key: NUTCH-196 > URL: http://issues.apache.org/jira/browse/NUTCH-196 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Fix For: 0.8-dev > Attachments: NUTCH-196.lib-log4j.patch > > Many places in Nutch use XML. Parsing XML using the JDK API is painful. I > propose to add one (or more) library plugins with JDOM, DOM4J, Jaxen, etc. > This should simplify the current deployment, and help plugin writers to use > the existing API. > Similarly, many plugins use log4j. Either we add it to the /lib, or we could > create a lib-log4j plugin. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira