[Nutch-dev] [jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Matt Kangas (JIRA) Mon, 22 May 2006 16:40:04 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ]


Matt Kangas commented on NUTCH-272:
-----------------------------------

Scratch my last comment. :-) I assumed that URLFilters.filter() was applied 
while traversing the segment, as it was in 0.7. Not true in 0.8... it's applied 
during Generate.

(Wow. This means the crawldb will accumulate lots of junk URLs over time. Is 
this a feature or a bug?)

> Max. pages to crawl/fetch per site (emergency limit)
> ----------------------------------------------------
>
>          Key: NUTCH-272
>          URL: http://issues.apache.org/jira/browse/NUTCH-272
>      Project: Nutch
>         Type: Improvement

>     Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency 
> limit" to fetch a certain max. number of pages per site. Is there an "easy" 
> way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

Reply via email to