I want to modify Nutch for increasing the score of some pages in their
CrawlDatum. The objective of this is recognizing which pages include a
certain token. Increasing the score to a high value will be useful for being
chosen again in the next Segment generation. 

I modified like this:

Fetcher.java:
(...)
case ProtocolStatus.SUCCESS:        // got a page
                  content.setContent((Integer.toString(0)).getBytes());
                  if (tokenIncluded) {
                      float score = Float.valueOf( "500.0" );
                      fit.datum.setScore(score);
                  }
(...)

After fetching, when the crawldb has to be updated, the entries of the MR
update proccess are the fetched urls. Each fetched one appeared twice like
this:

http://bardeportes.blogspot.com/
Version: 7
Status: 2 (db_fetched)
Fetch time: Thu Mar 11 11:06:09 CET 2010
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: null
Metadata: _pst_: success(1), lastModified=0

...

http://bardeportes.blogspot.com/
Version: 7
Status: 33 (fetch_success)
Fetch time: Thu Mar 11 11:06:17 CET 2010
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 500.0
Signature: null
Metadata: _ngt_: 1268301973666_pst_: success(1), lastModified=0


And the final score introduce to crawldb is 1.0 (never 500.0)
Any idea?

(I hope someone understands the issue)
-- 
View this message in context: 
http://old.nabble.com/Increasing-the-score-of-especific-pages-tp27861656p27861656.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to