[ https://issues.apache.org/jira/browse/NUTCH-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-1822. ------------------------------------ Resolution: Won't Fix (closing 2.x issue as this version isn't maintained anymore) > Page outlinks clearance is not appropriate > ------------------------------------------- > > Key: NUTCH-1822 > URL: https://issues.apache.org/jira/browse/NUTCH-1822 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 2.1 > Environment: Nutch-2.1 > Hadoop-0.20.205 > HBase-0.90.6 > hbase-gora-0.2.1 > Reporter: Riyaz Shaik > Priority: Major > > 1. When a page is re-crawled and identified with new outlink urls along with > the existing urls, old outlinks are getting removed and only new urls are > updated to hbase. > Ex: > Crawl cycle 1 for www.123.com, identified outlinks are > ol --> abc.com > ol --> pqr.com > Crawlcyle 2 of same www.123.com, the outlinks are > (note that abc.com is removed and added with xyz.com) > ol --> pqr.com > ol --> xyz.com > At the end of crawlcycle 2, base has only xyz.com as outlink > ol -->xyz.com > Expected: > ol --> pqr.com > ol --> xyz.com > 2. If some of the outlinks of the page got removed and no new outlinks are > added to the page then page re-crawl is not clearing the obsolete/removed > outlinks from hbase. > Ex: Cycle 1 crawled page : www.test.com, identified outlinks are > ol -->link1 > ol-->link2 > ol-->link3 > Cycle 2 same page(www.text.com) re-crawled, identified outlinks are > (Note: only removed the link2 no new links are added) > ol-->link1 > ol-->link3 > but the end of the cycle 2.,it has all the 3 outlinks in hbase > in habse: > ol -->link1 > ol-->link2 > ol-->link3 > expected: > ol-->link1 > ol-->link3 > As per the code ParseUtil.java, it seems to be removing the old links and > insets onlythe new links. > if (page.getOutlinks() != null) { page.getOutlinks().clear(); } > http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html > Thanks > Riyaz -- This message was sent by Atlassian Jira (v8.20.10#820010)