Re: [Nutch-dev] Not renewing CrawlDatum on Inject

Andrzej Bialecki Mon, 09 Jul 2007 12:18:46 -0700

Robert Young wrote:
> I have been trying to get to grips with
> org.apache.nutch.crawl.Injector to help with a requirement I have for
> the project I'm working on and I'm a little confused about one place.
> On lines 120 - 121 any existing CrawlDatum is used instead of the
> newly injected one. This doesn't seem to make sense from my point of
> view, I'm guessing it's just a matter of not being able to see the
> issue from the other side. The scenario I an in is as such, when I
> inject a url it is because I want it to be re-indexed, maybe because
> it's changed, I don't care if that url's already in the crawldb I want
> it re-indexed. As far as I can see, if this wasn't the case I wouldn't
> be trying to inject it.
> 
> What am I missing here? Why is the existing CrawlDatum used instead of
> the newly injected one?


That's indeed a place in Nutch that I planned to change for a long time 
... This behavior is not obvious, what's worse it's undocumented.

It would be relatively simple to extend this behavior so that only 
selected parts of data would be updated or replaced when a seed list 
contains the same URL as the one already in CrawlDb.

For now, just add the code that you need in Injector.InjectReducer.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Not renewing CrawlDatum on Inject

Reply via email to