[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445 ] Doğacan Güney commented on NUTCH-289: - It seems this issue has kind of died down, but this would be a great feature to have. Here is how I think we can do this one (my proposal is _heavily_ based on Stefan Groschupf's work): * Add ip as a field to CrawlDatum * Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum already has an ip). * A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and probably crawldb) that (optionally) runs before updatedb. - map: url, CrawlDatum - host of url, url, CrawlDatum . Add a field to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) it is coming from(which will be removed in reduce). No lookup is performed in map(). - reduce: host, list(url, CrawlDatum) - url, CrawlDatum. If any CrawlDatum already contains an ip address (ip addresses in crawl_fetch having precedence over ones in crawldb) then output all crawl_parse datums with this ip address. Otherwise, perform a lookup. This way, we will not have to resolve ip for most of urls (in a way, we will still be getting the benefits of jvm cache :). A downside of this approach is that we will either have to read crawldb twice or perform ip lookups for hosts in crawldb (but not in crawl_fetch). * use cached ip during generation, if it exists. CrawlDatum should store IP address -- Key: NUTCH-289 URL: https://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12450315 ] Uros Gruber commented on NUTCH-289: --- One question. Why does IP need to be in CrawlDatum and not in metadata? CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413996 ] Andrzej Bialecki commented on NUTCH-289: - Re: lookup in ParseOutputFormat: I respectfully disagree. Consider the scenario when you run Fetcher in non-parsing mode. This means that you have to make two DNS lookups - once when fetching, and the second time when parsing. These lookups will be executed from different processes, so there is no benefit from caching inside Java resolver, i.e. the process will have to call the DNS server twice. The solution I proposed (record IP-s in Fetcher, but somewhere else than in ParseOutputFormat, e.g. crawl_fetch CrawlDatum) avoids this problem. Another issue is virtual hosting, i.e. many sites resolving to a single IP (web hotels). It's true that in many cases these are spam sites, but often as not they are real, legitimate sites. If we generate/fetch by IP address we run the risk of dropping legitimate sites. Regarding the timing: it's true that during the first run we won't have IP-s during generate (and subsequently for any newly injected URLs). In fact, since usually a significant part of crawlDB is unfetched we won't have this information for many URLs - unless we run this step in Generator to resolve ALL hosts, and then run an equivalent of updatedb to actually record them in crawldb. And the last issue that needs to be discussed: should we use metadata, or add a dedicated field in CrawlDatum? If the core should rely on IP addresses, we should add it as a dedicated field. If it would be purely optional (e.g. for the use by optional plugins), then metadata seems a better place. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] Doug Cutting commented on NUTCH-289: It should be possible to partition by IP and limit fetchlists by IP. Resolving only in the fetcher is too late to implement these features. Ideally we should arrange things for good DNS cache utilization, so that urls with the same host are resolved in a single map or reduce task. Currently this is the case during fetchlist generation, where lists are partitioned by host. Might that be a good place to insert DNS resolution? The fetchlists would need to be processed one more time, to re-partition and re-limit by IP, but fetchlists are relatively small, so this might not slow things too much. The map task itself could directly cache IP addresses, and perhaps even avoid many DNS lookups by using the IP from another CrawlDatum from the same host. A multi-threaded mapper might also be used to allow for network latencies. This should, at least initially, be an optional feature, and thus the IP should probably initially be stored in the metadata. I think it might be added as a re-generate step without changing any other code. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ] Matt Kangas commented on NUTCH-289: --- +1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher or otherwise) CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] Stefan Groschupf commented on NUTCH-289: +1 Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as Doug suggested. The biggest problem nutch has at the moment is spam. The most often seen spam method is to setup a dns return the same ip for all subdomains and than deliver dynamically generated content. Than spammers just randomly generate subdomains within the content. Also it happens often that they have many url but all of them pointing to the same server == ip. Buying more ip addresses is possible but in the moment more expansive than buying more domains. Limit the urls by Ip is a great approach to prevent the crawler staying in honey pots with ten thousends of urls pointing to the same ip. However to do so we need to have the ip already until generation and not lookup it when fetching. We would be able to reuse the ip in the fetcher, also we can try catch the parts in the fetcher and in case the ip is not available we can re lookup the ip. I don't think round robbing dns are huge problem, since only large sites have them and in such a case each ip is able to handle requests. In any case storing the ip in crawl-datum and use it for urls by ip limitations will be a gib step forward to in the fight against web spam. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413604 ] Andrzej Bialecki commented on NUTCH-289: - I'm not sure how to address round-robin DNS with your approach ... Also, I think the best place to resolve and record the IPs is in the fetcher, because it has to do it anyway. When generating we won't know the IPs until the next cycle, but the load on DNS will be much lower / more evenly distributed. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira