[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

Stefan Groschupf (JIRA) Thu, 01 Jun 2006 11:42:24 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414273 ]


Stefan Groschupf commented on NUTCH-289:
----------------------------------------

Andrzej, I'm afraid I was not able to clearly communicate my ideas and we may 
be misunderstand each other. 
Resolve the ip in Parseoutputformat would be only necessary for the new links 
discovered in the content. 
Since by default we parse during fetching we would have the chance to use the 
jvm dns cache, since I guess many new urls point to the same host where we 
fetched a particular page from. Means if we do not parse separately we would 
have the best jvm cache usage. 
We do not lookup IPs of urls we fetch at this time, since these urls already 
have a ip that was resoved when these urls was first time discovered in a parse 
process. 
The only problem we need to handle is what happens in case a ip of a host 
change. We can simple lookup the ip of a url that throws a protocol error and 
compare cached and lookup ip.
An alternative aproche would be to lookup ip's during crawldb update just for 
the new urls.
Sorry I hope that describe my ideas more clearly. 

My personal point of view is to store the ip into the crawldatum not into the 
meta data.






> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

Reply via email to