[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tejas Patil closed NUTCH-1329. ------------------------------ Resolution: Cannot Reproduce Closing for now by marking it "cannot reproduce" > parser not extract outlinks to external web sites > ------------------------------------------------- > > Key: NUTCH-1329 > URL: https://issues.apache.org/jira/browse/NUTCH-1329 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Reporter: behnam nikbakht > Labels: parse > Fix For: 2.3, 1.8 > > > found a bug in > /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, > that outlinks like www.example2.com from www.example1.com are inserted as > www.example1.com/www.example2.com > i correct this bug by testing that if outlink (www.example2.com) is a valid > url, else inserted with it's base url > so i replace these lines: > URL url = URLUtil.resolveURL(base, target); > outlinks.add(new Outlink(url.toString(), > linkText.toString().trim())); > with: > String host_temp=null; > try{ > host_temp=URLUtil.getDomainName(new URL(target)); > } > catch(Exception eiuy){ > host_temp=null; > } > URL url=null; > if(host_temp==null)// it is an internal outlink > url = URLUtil.resolveURL(base, target); > else //it is an external link > url=new URL(target); > outlinks.add(new Outlink(url.toString(), > linkText.toString().trim())); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira