Doğacan Güney wrote: > Andrzej, nice to see you working on this. > > There is one thing that I don't understand about your presentation. > Assume that page A is the only url in our crawldb and it contains n > outlinks. > > t = 0 - Generate runs, A is generated. > > t = 1 - Page A is fetched and its cash is distributed to its outlinks. > > t = 2 - Generate runs, pages P0-Pn are generated. > > t = 3 - P0 - Pn are fetched and their cash are distributed to their > outlinks. > - At this time, it is possible that page Pk links to page A. > So, now Page A's cash > 0. > > t = 4 - Generate runs, page A is considered but is not generated > (since its next fetch time is later than current time). > - Won't page A become a temporary sink? Time between > subsequent fetches may be as large as 30 days in default > configuration. So, page A will accumulate cash for a long time without > distributing it.
Yes. That's why Abiteboul used history (several cycles long) to smooth out temporary imbalances in cache redistribution. The history component described in the paper could be either several cycles long, or specific period of time long. In our case I think the history for rarely updated pages should span the db.max.interval period plus some, and for frequently updated pages it should span several cycles. > - I don't see how we can achieve that, but, IMO, if a page is > considered but not generated, nutch should distribute its cash to > outlinks the outlinks that are stored in its parse data. (I know that > this is incredibly hard (if not impossible) to do this.) Actually we store outlinks in two places - one place is obviously the segments. The other less obvious place is the linkdb - the data is there, it just needs to be inverted (again). So, theoretically, we could modify the updatedb process to consider the complete webgraph, i.e. all link information collected so far - but the main attractiveness of OPIC is that it's incremental, so that you don't have to consider the whole webgraph with small incremental updates. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
