Andrzej Bialecki wrote:
When we used WebDB it was possible to overlap generate / fetch / update
cycles, because we would "lock" pages selected by FetchListTool for a
period of time.
Now we don't do this. The advantage is that we don't have to rewrite
CrawlDB. But operations on CrawlDB are considerably faster than on
WebDB, perhaps we should consider going back to this method?
Yes, this would be a good addition.
Ideally we should change Crawl.java to overlap these too. When -topN is
specified and substantially smaller than the total size of the crawldb,
then we can generate, start a fetch job, then generate again. As each
fetch completes, we can start the next, then run an update and generate
based on the just-completed fetch, so that we're constantly fetching.
This could be implemented by: (a) adding a status for generated crawl
data; (b) adding a option to updatedb to include the generated output
from some segments. Then, in the above algorithm, the first time we'd
update with only the generator output, but, after that, we can combine
the updates with fetcher and generator output. This way, in the course
of a crawl, we only re-write the crawldb one additional time, rather
than twice as many times. Does this make sense?
And a final note: CrawlDB.update() uses the initial score value
recorded in the segment, and NOT the value that is actually found in
CrawlDB at the time of the update. This means that if there was
another update in the meantime, your new score in CrawlDB will be
overwritten with the score based on an older initial value. This is
counter-intuitive - I think CrawlDB.update() should always use the
latest score value found in the current CrawlDB. I.e. in
CrawlDBReducer instead of doing:
result.setScore(result.getScore() + scoreIncrement);
we should do:
result.setScore(old.getScore() + scoreIncrement);
The change is not quite that simple, since 'old' is sometimes null.
So perhaps we need to add an 'score' variable that is set to old.score
when old!=null and to 1.0 otherwise (for newly linked pages).
The reason I didn't do it that way was to permit the Fetcher to modify
scores, since I was thinking of the Fetcher as the actor whose actions
are being processed here, and of the CrawlDb as the passive thing
acted on. But indeed, if you have another process that's updating a
CrawlDb while a Fetcher is running, this may not be the case. So if
we want to switch things so that the Fetcher is not permitted to
adjust scores, then this seems like a reasonable change.
I would vote for implementing this change. The reason is that the active
actor that computes new scores is CrawlDb.update(). Fetcher may provide
additional information to affect the score, but IMHO the logic to
calculate new scores should be concentrated in the update() method.
I agree: +1. I was just trying to explain the existing logic. I think
this would provide a significant improvement, with little lost.
Doug
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers