[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412642 ]
Matt Kangas commented on NUTCH-272: ----------------------------------- Ok, I just re-read Generator.java ( http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?view=markup ) * Selector.map() keeps values where crawlDatum.getFetchTime() <= curTime * Selector.reduce() collects until "limit" is reached, optionally skipping the url if "hostCount.get() > maxPerHost" So it caps _URLs/host going in this fetchlist_. Not total URLs/host. That's what I thought, and is insufficent for the reasons stated above. (Will incrementally fetch everything.) If the cap is 50k and a host has 70k active URLs in the crawldb, what Generate needs to say is "Here are the first 50k URLs added for this site, and I see only 3 are scheduled. We'll put 3 in this fetchlist." Generate can only enforce a limit if it knows which 50k were _first_ added to the db, and _never_ fetch any of the latter 20k. Hmm... it seems straightforward to modify Generate.java to count total URLs/host during map(), regardless of fetchTime. But I don't see what action we could take besides halting all fetches for the site. We'd have to traverse crawldb in order of record-creation time to be able to see which were the first N added to the crawldb. (i think the crawldb is sorted by url, not ctime) > Max. pages to crawl/fetch per site (emergency limit) > ---------------------------------------------------- > > Key: NUTCH-272 > URL: http://issues.apache.org/jira/browse/NUTCH-272 > Project: Nutch > Type: Improvement > Reporter: Stefan Neufeind > > If I'm right, there is no way in place right now for setting an "emergency > limit" to fetch a certain max. number of pages per site. Is there an "easy" > way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
