[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] 

Stefan Groschupf commented on NUTCH-289:
----------------------------------------

+1
Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as 
Doug suggested.
The biggest problem nutch has at the moment is spam. The most often seen spam 
method is to setup a dns return the same  ip for all subdomains and than 
deliver dynamically generated content. 
Than spammers just randomly generate subdomains within the content. Also it 
happens often that they have many url but all of them pointing to the same 
server == ip. 
Buying more ip addresses is possible but in the moment more expansive than 
buying more domains. 

Limit the urls by Ip is  a great approach to prevent the crawler staying in 
honey pots with ten thousends of urls pointing to the same ip. 
However to do so  we need to have the ip already until generation and not 
lookup it when fetching. 
We would be able to reuse the ip in the fetcher, also we can try catch the 
parts in the fetcher and in case the ip is not available we can re lookup the 
ip. 
I don't think round robbing dns are huge problem, since only large sites have 
them and in such a case each ip is able to handle requests.
In any case storing the ip in crawl-datum and use it for urls by ip limitations 
will be a gib step forward to in the fight against web spam.

> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to