[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] 

Doug Cutting commented on NUTCH-289:
------------------------------------

It should be possible to partition by IP and limit fetchlists by IP.  Resolving 
only in the fetcher is too late to implement these features.   Ideally we 
should arrange things for good DNS cache utilization, so that urls with the 
same host are resolved in a single map or reduce task.  Currently this is the 
case during fetchlist generation, where lists are partitioned by host.  Might 
that be a good place to insert DNS resolution?  The fetchlists would need to be 
processed one more time, to re-partition and re-limit by IP, but fetchlists are 
relatively small, so this might not slow things too much.  The map task itself 
could directly cache IP addresses, and perhaps even avoid many DNS lookups by 
using the IP from another CrawlDatum from the same host.  A multi-threaded 
mapper might also be used to allow for network latencies.

This should, at least initially, be an optional feature, and thus the IP should 
probably initially be stored in the metadata.  I think it might be added as a 
re-generate step without changing any other code.


> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to