Robert Young wrote: > I have been trying to get to grips with > org.apache.nutch.crawl.Injector to help with a requirement I have for > the project I'm working on and I'm a little confused about one place. > On lines 120 - 121 any existing CrawlDatum is used instead of the > newly injected one. This doesn't seem to make sense from my point of > view, I'm guessing it's just a matter of not being able to see the > issue from the other side. The scenario I an in is as such, when I > inject a url it is because I want it to be re-indexed, maybe because > it's changed, I don't care if that url's already in the crawldb I want > it re-indexed. As far as I can see, if this wasn't the case I wouldn't > be trying to inject it. > > What am I missing here? Why is the existing CrawlDatum used instead of > the newly injected one?
That's indeed a place in Nutch that I planned to change for a long time ... This behavior is not obvious, what's worse it's undocumented. It would be relatively simple to extend this behavior so that only selected parts of data would be updated or replaced when a seed list contains the same URL as the one already in CrawlDb. For now, just add the code that you need in Injector.InjectReducer. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
