[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558228#comment-13558228 ]
Tejas Patil commented on NUTCH-1329: ------------------------------------ I am not able to reproduce this bug with the default config. Are there any specific configs that you were using ? > parser not extract outlinks to external web sites > ------------------------------------------------- > > Key: NUTCH-1329 > URL: https://issues.apache.org/jira/browse/NUTCH-1329 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Reporter: behnam nikbakht > Labels: parse > Fix For: 1.7, 2.2 > > > found a bug in > /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, > that outlinks like www.example2.com from www.example1.com are inserted as > www.example1.com/www.example2.com > i correct this bug by testing that if outlink (www.example2.com) is a valid > url, else inserted with it's base url > so i replace these lines: > URL url = URLUtil.resolveURL(base, target); > outlinks.add(new Outlink(url.toString(), > linkText.toString().trim())); > with: > String host_temp=null; > try{ > host_temp=URLUtil.getDomainName(new URL(target)); > } > catch(Exception eiuy){ > host_temp=null; > } > URL url=null; > if(host_temp==null)// it is an internal outlink > url = URLUtil.resolveURL(base, target); > else //it is an external link > url=new URL(target); > outlinks.add(new Outlink(url.toString(), > linkText.toString().trim())); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira