[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925543#action_12925543 ]
Andrzej Bialecki commented on NUTCH-926: ----------------------------------------- bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!! No need to shout, we hear you :) Indeed, Nutch behavior when following redirects doesn't play well with the rule of ignoring external outlinks. Strictly speaking, redirects are not outlinks, but the silent assumption behind ignoreExternalOutlinks is that we crawl content only from that hostname. And your patch would solve this particular issue. However, this is not as simple as it seems... My favorite example is www.ibm.com -> www8.ibm.com/index.html . If we apply your fix you won't be able to crawl www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple equality of hostnames may not be sufficient. We have utilities to extract domain names, so we could compare domains but then we may mistreat money.cnn.com vs. weather.cnn.com ... > Nutch follows wrong url in <META http-equiv="refresh" tag > --------------------------------------------------------- > > Key: NUTCH-926 > URL: https://issues.apache.org/jira/browse/NUTCH-926 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.2 > Environment: gnu/linux centOs > Reporter: Marco Novo > Priority: Critical > Fix For: 1.3 > > Attachments: ParseOutputFormat.java.patch > > > We have nutch set to crawl a domain urllist and we want to fetch only passed > domains (hosts) not subdomains. > So > WWW.DOMAIN1.COM > .. > .. > .. > WWW.RIGHTDOMAIN.COM > .. > .. > .. > .. > WWW.DOMAIN.COM > We sets nutch to: > NOT FOLLOW EXERNAL LINKS > During crawling of WWW.RIGHTDOMAIN.COM > if a page contains > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <title></title> > <META http-equiv="refresh" content="0; > url=http://WRONG.RIGHTDOMAIN.COM"> > </head> > <body> > </body> > </html> > Nutch continues to crawl the WRONG subdomains! But it should not do this!! > During crawling of WWW.RIGHTDOMAIN.COM > if a page contains > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <title></title> > <META http-equiv="refresh" content="0; > url=http://WWW.WRONGDOMAIN.COM"> > </head> > <body> > </body> > </html> > Nutch continues to crawl the WRONG domain! But it should not do this! If that > we will spider all the web.... > We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have > done a patch so we will attach it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.