[Nutch-general] FW: Restricting crawl to a certain topic

Milan Krendzelak Thu, 12 Jul 2007 02:22:06 -0700

Hi Carl,
 
how about this:
 
create new plugin which will run during the indexing. Its extention point will 
be ScoringFilter ( this should be set up in plugin.xml)
 
your plugin will implement ScoringFilter interface. In you case, I guess, will 
be enough to implement indexerScore.
 
Simple implementation of this function is like this:
 
String scoreFactorStr = parse.getData().getParseMeta().get("Score_Factor");
if( !isOk(scoreFactorStr) ) { // if content is OK
  return (doc.getBoost() * 0.0001f );  // boost document down - suppress 
document score
}
 
return doc.getBoost();
 
 
Milan Krendzelak



________________________________

From: Carl Cerecke [mailto:[EMAIL PROTECTED]
Sent: Thu 12/07/2007 05:24
To: [EMAIL PROTECTED]
Subject: Re: Restricting crawl to a certain topic



Carl Cerecke wrote:
> Andrzej Bialecki wrote:
>> Carl Cerecke wrote:
>>> Hi,
>>>
>>> I'm wondering what the best approach is to restrict a crawl to a
>>> certain topic. I know that I can restrict what is crawled by a regex
>>> on the URL, but I also need to restrict pages based on their content
>>> (whether they are on topic or not).
>>>
>>> For example, say I wanted to crawl pages about Antarctica. First I
>>> start off with a handful of pages and inject them into the crawldb,
>>> and I generate a fetchlist, and can start sucking the pages down. I
>>> update the crawldb with links from what has just been sucked down,
>>> and then during the next fetch (and subsequent fetches), I want to
>>> filter which pages end up in the segment based on their content
>>> (using, perhaps some sort of antarctica-related-keyword score).
>>> Somehow I also need to tell the crawldb about the URLS which I've
>>> sucked down but aren't antarctica-related pages (so we don't suck
>>> them down again).
>>>
>>> This seems like the sort of problem other people have solved. Any
>>> pointers? Am I on the right track here? Using nutch 0.9
>>
>> The easiest way to do this is to implement a ScoringFilter plugin,
>> which promotes wanted pages and demotes unwanted ones. Please see
>> Javadoc for the ScoringFilter for details.
>
> I've given this a crack and it mostly seems to work, except I'm not sure
> how to get the score back into the crawldb. After reading the Javadoc, I
> figured that passScoreAfterParsing() was the method I need to implement.
> All others are just simple one-liners for this case. Unfortunately,
> passScoreAfterParsing() is alone in not having a CrawlDatum argument, so
> I can't call datum.setScore(); I did notice that OPICScoringFilter does
> this in passScoreAfterParsing:
> parse.getData().getContentMeta().set(Nutch.SCORE_KEY,  ...); and I tried
> that in my own scoring filter, but just get the zero from
> datum.setScore(0.0f) in initalScore().
>
> Couple of questions then:
> 1. Does it make sense to put the relevancy scoring code into
> passScoreAfterParsing()
> 2. If so, how do I get the score into the crawldb?
>
> I'm a bit vague on how all these bits connect together under the hood at
> the moment.....

Spent all day on this, but no luck. I'm sure I'm missing something
obvious. Glad for any pointers in the right direction.

Cheers,
Carl.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] FW: Restricting crawl to a certain topic

Reply via email to