[EMAIL PROTECTED] wrote:
> Hi All,
>
> I think I am missing something. In the Injector reduce code we have the
> following.
>
> ------------------------------------------------------------------------
> while (values.hasNext()) {
> CrawlDatum val = (CrawlDatum)values.next();
> if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
> injected = val;
> injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
> } else {
> old = val;
> }
> }
>
> CrawlDatum res = null;
> if (old != null) res = old; // don't overwrite existing value
> else res = injected;
> ------------------------------------------------------------------------
>
> Basically if it is not just injected then don't overwrite. But I am not
> seeing where the input could be such that the CrawlDatum wasn't just
> injected and could have previous values. Is this just in case someone
> uses the Injector as a Reducer and not a Mapper or am I missing how this
> condition can occur.
>
This handles an important case, when you inject URLs that already exist
in the DB - then you have both the old value and the newly created value
under the same key. In previous versions of Injector CrawlDatum-s for
such URLs could be overwritten with new values, and you could lose
valuable metadata accumulated in old values.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers