[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372379 ] 

Andrzej Bialecki  commented on NUTCH-240:
-

Yes, one of the reasons I wanted to discuss these patches is that they 
uncovered some of the underlying ugliness... ;)

The reson for generator store/restore is that scoring plugins could take into 
account many more variables than just the score recorded in CrawlDatum.score. 
They could also have different strategies for prioritizing pages to be included 
in topN.

So, it's true this is not currently used by OPIC but I think without this it's 
not possible for plugins to affect the choice of topN.

Initially, I did as you suggest, i.e. I created a method to calculate one float 
value for the purpose of selecting topN. However, I wanted to avoid changing 
CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big 
performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score 
it seemed to me we should store its earlier value, and then possibl restore - 
as the value for selecting topN may have nothing to do with the "real" score.

passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, 
but that's what we do at the moment, I just extracted it into an interface. I'd 
love to skip this altogether, if there is a way.

> Scoring API: extension point, scoring filters and an OPIC plugin
> 
>
>  Key: NUTCH-240
>  URL: http://issues.apache.org/jira/browse/NUTCH-240
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a 
> plugin-based API. Using this API it's possible to implement different scoring 
> algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to 
> URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current 
> implementation of the scoring algorithm. Together with the scoring API it 
> provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] 

Doug Cutting commented on NUTCH-240:


The generator store/restore score stuff seems ugly.  And it is not used by 
OPIC.  Could we insteadhave a method that computes and returns a score to be 
used by the generator?  Then it is up to the generator to use this w/o 
modifying the CrawlDatum.

The passScoreBeforeParsing/passScoreAfterParsing/distributeScoreToOutlink 
protocol also seems awkward, although I don't yet have a suggestion for how to 
improve it.

> Scoring API: extension point, scoring filters and an OPIC plugin
> 
>
>  Key: NUTCH-240
>  URL: http://issues.apache.org/jira/browse/NUTCH-240
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
>  Attachments: patch.txt
>
> This patch refactors all places where Nutch manipulates page scores, into a 
> plugin-based API. Using this API it's possible to implement different scoring 
> algorithms. It is also much easier to understand how scoring works.
> Multiple scoring plugins can be run in sequence, in a manner similar to 
> URLFilters.
> Included is also an OPICScoringFilter plugin, which contains the current 
> implementation of the scoring algorithm. Together with the scoring API it 
> provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Refactoring some plugins

2006-03-29 Thread Doug Cutting

Jérôme Charron wrote:

Moreover, I would like to suggest some other javadoc "improvements" (?):

1. Create a group for abstract plugins (like lib-http or lib-regex-filter)
named for instance "Plugins API"


+1


2. Create a group for extensions points (As far as I remember, one of the
first problem when you want
to extend nutch is to found where are the hooks, ie what are the extensions
points). One more time, since the
javadoc groups are filtered by packages, each extension point interface must
be moved to specific package.
The idea is then to move all the core extensions points to a new package
(for instance org.apache.nutch.api).


I'm reluctant to move the extension interface away from the parameter 
and return value classes used by that interface.  Could we instead add a 
super-interface that all extension-point interfaces extend?  That way 
all of the extension points would be listed in javadoc as 
implementations of this interface.



3. Create many javadoc plugins groups (one for each major kind of plugin :
Indexing, Parsing, Protocol, Query, UrlFilter and
Misc for those that cannot be categorized).


+1

Doug


Re: Refactoring some plugins

2006-03-29 Thread Jérôme Charron
> I don't think it upside down.  Plugins should not share packages with
> core code, since that would permit them to use package-private APIs.
> Also, re-arranging the code to make the javadoc nice is right, since the
> javadoc is a primary means of describing the code.

Yes, but what I mean is that it is "stange" that it is a documentation issue
that
raise this need for refactoring.

Moreover, I would like to suggest some other javadoc "improvements" (?):

1. Create a group for abstract plugins (like lib-http or lib-regex-filter)
named for instance "Plugins API"
2. Create a group for extensions points (As far as I remember, one of the
first problem when you want
to extend nutch is to found where are the hooks, ie what are the extensions
points). One more time, since the
javadoc groups are filtered by packages, each extension point interface must
be moved to specific package.
The idea is then to move all the core extensions points to a new package
(for instance org.apache.nutch.api).
3. Create many javadoc plugins groups (one for each major kind of plugin :
Indexing, Parsing, Protocol, Query, UrlFilter and
Misc for those that cannot be categorized).

Thanks for your suggestions and comments.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-29 Thread Richard Braman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372310 ] 

Richard Braman commented on NUTCH-220:
--

I upgraded nutch .8 trunk to PDFBox HEAD.

The NullPointer exception Seems to be resolved by upgrading nutch to PDFBox 
0.7.3

The major issues in upgrading nutch to 0.7.3 are:

1.  PDFBOx now depends on Font Box, which must be included as a plugin 
lib-fontbox
2.  PDFBox no longer depends on log4j, when I tired to remove references to the 
dependency in the build.xml for porase-pdf, it returns assorted ant build 
errors,  I left the references to log4j and it built fine

someone who has more knowledge of building nutch needs to modify the build and 
plugin.xml if refernces to log4j should be removed?

plugin.xml for FontBox

   
 

 
   


build.xml for lib-fontbox

  
  
  
  

  

  


parse-pdf plugin.xml

   
  
 
  
  
  
  
   
   
  
  
  
   
   
  

   


parse-pdf build.xml

  
  
  

 
  
  
  

  


  
  
  

  


  
  
  
  


> PDF Box can't parse document: java.lang.NullPointerException
> 
>
>  Key: NUTCH-220
>  URL: http://issues.apache.org/jira/browse/NUTCH-220
>  Project: Nutch
> Type: Bug
>  Environment: PDFBox 0.7.2
> Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested 
> with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -Original Message-
> > From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > [EMAIL PROTECTED]
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-29 Thread Ben Litchfield (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372277 ] 

Ben Litchfield commented on NUTCH-220:
--

Actually, now that I look at the stack trace, the NPE is not happening in 
PDFBox code it appears to be in hadoop code, so I don't think that upgrading 
PDFBox will help.  

Ben

> PDF Box can't parse document: java.lang.NullPointerException
> 
>
>  Key: NUTCH-220
>  URL: http://issues.apache.org/jira/browse/NUTCH-220
>  Project: Nutch
> Type: Bug
>  Environment: PDFBox 0.7.2
> Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested 
> with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -Original Message-
> > From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > [EMAIL PROTECTED]
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-29 Thread Richard Braman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12372275 ] 

Richard Braman commented on NUTCH-220:
--

PDFBox-0.7.3 no longer depends on log4j at all, so you should not be
getting any log4j errors from PDFBox!

Ben


On Sun, 26 Mar 2006, Richard Braman wrote:

> > Hi Ben,
> > I noticed that the nutch uses a log4j version of PDFBox.jar.  I don't
> > see this as an ant target on 0.7.3 .  I downloaded pdfbox from CVS Head.
> >
> > When I tried to use the PDFBox nightly it gave me a bunch of log4j
> > errors, so I guess nutch expects the log4j version.
> >
> > I am trying to upgrade my nutch to 0.7.3 to see if I can get arid of the
> > NPE error.
> >
> > The NPE bug I told you about a few weeks ago is much worse effect in
> > Nutch .8, as it seems to cause the fetcher to abort.
> >
> > 060326 142450 fetch of
> > http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calculation.pdf
> > failed with: java.lang.NullPointerException
> > java.lang.NullPointerException
> > at
> > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180)
> > at
> > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171)
> > at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
> > at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:245)
> > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185)
> > 060326 142450 SEVERE fetcher caught:java.lang.NullPointerException
> >
> > --
> > Richard L Braman, Jr., CPA
> > Tax Code Software Foundation, Inc.
> > Open Source Tax Software
> > http://www.taxcodesoftware.org
> > [EMAIL PROTECTED]
> >


> PDF Box can't parse document: java.lang.NullPointerException
> 
>
>  Key: NUTCH-220
>  URL: http://issues.apache.org/jira/browse/NUTCH-220
>  Project: Nutch
> Type: Bug
>  Environment: PDFBox 0.7.2
> Reporter: Richard Braman

>
> This error was fixed in the ltest build of PDFBOx, which should be tested 
> with nutch.
> >> 060228 160354 fetch okay, but can't parse
> >> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> >> failed(2,0): Can't be handled as pdf document. 
> >> java.lang.NullPointerException
> Yes, the NPE should be fixed.
>  Ben
> Richard Braman wrote:
> > Hi Bn,
> >
> > We actually got to the bottom of all of them except for 1... The 
> > content truncatetion was due to an inconsistancy bug in nutch config .
> > The no permission to extract text is actually true, the author, the NC
> > Department of revenue put this restriction on all of their files (I have
> > asked them to remove it as it hampers public accessability).  The Null
> > pointer exception is the only one to deal with that may be due to the
> > parsing bug .  Is this one that you are referring to?
> >
> > -Original Message-
> > From: Ben Litchfield [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, March 02, 2006 4:07 PM
> > To: Richard Braman
> > Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
> > [EMAIL PROTECTED]
> > Subject: Re: [PDFBox-user] PDF Parse Error
> >
> >
> >
> > I believe these errors are due to a parsing bug in PDFBox that has 
> > been fixed since the 0.7.2 release.  Please give the nightly 
> > build(should be a drop in replacement) a try from 
> > http://www.pdfbox.org/dist and let me know if you are still having 
> > issues.
> >
> > Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-48) "Did you mean" query enhancement/refignment feature request

2006-03-29 Thread Aled Rhys Jones (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-48?page=all ]

Aled Rhys Jones updated NUTCH-48:
-

Attachment: rss-spell.patch

Added patch to add spelling correction to the rss feed in the following 
opensearch format:


This patch must be applied after spell-check.patch.


> "Did you mean"  query enhancement/refignment feature request
> 
>
>  Key: NUTCH-48
>  URL: http://issues.apache.org/jira/browse/NUTCH-48
>  Project: Nutch
> Type: New Feature
>   Components: web gui
>  Environment: All platforms
> Reporter: byron miller
> Assignee: Sami Siren
> Priority: Minor
>  Attachments: rss-spell.patch, spell-check.patch
>
> Looking to implement a "Did you mean" feature for query result pages that 
> return < = x amount of results to invoke a response that would recommend a 
> fixed/related or spell checked query to try.
> Note from Doug to users list:
> David Spencer has worked on this some.
> http://www.searchmorph.com/weblog/index.php?id=23
> I think the code on his site might be more recent than what's committed
> to the lucene/contrib directory.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [jira] Closed: (NUTCH-196) lib-xml and lib-log4j plugins

2006-03-29 Thread Andrzej Bialecki

Jerome Charron (JIRA) wrote:

 [ http://issues.apache.org/jira/browse/NUTCH-196?page=all ]
 
Jerome Charron closed NUTCH-196:



Fix Version: 0.8-dev
 Resolution: Fixed

Added a lib-xml that gathers many xml libraries previously used in parse-rss.
(http://svn.apache.org/viewcvs?rev=389716&view=rev)
  


Thanks!

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Closed: (NUTCH-196) lib-xml and lib-log4j plugins

2006-03-29 Thread Jerome Charron (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-196?page=all ]
 
Jerome Charron closed NUTCH-196:


Fix Version: 0.8-dev
 Resolution: Fixed

Added a lib-xml that gathers many xml libraries previously used in parse-rss.
(http://svn.apache.org/viewcvs?rev=389716&view=rev)


> lib-xml and lib-log4j plugins
> -
>
>  Key: NUTCH-196
>  URL: http://issues.apache.org/jira/browse/NUTCH-196
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 
>  Fix For: 0.8-dev
>  Attachments: NUTCH-196.lib-log4j.patch
>
> Many places in Nutch use XML. Parsing XML using the JDK API is painful. I 
> propose to add one (or more) library plugins with JDOM, DOM4J, Jaxen, etc. 
> This should simplify the current deployment, and help plugin writers to use 
> the existing API.
> Similarly, many plugins use log4j. Either we add it to the /lib, or we could 
> create a lib-log4j plugin.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira