Hi

Looks like yet another bug with Nutch 2.x. Could you open a JIRA and tag
the issue for 2.3? In the meantime I'd advise you to use Nutch 1.x which is
more reliable, has more features and is also an awful lot faster.

Julien


On 11 July 2014 10:01, mesenthil1 <
senthilkumar.arumu...@viacomcontractor.com> wrote:

> Hi,When a page is re-crawled on Nutch and identified new outlink urls alnog
> with the existing urls, old outlinks are getting removed and only new urls
> are updated to hbase.  For examplecrawl cycle 1 for www.123.com,
> identified
> outlinks are abc.compqr.comcrawlcyle 2 of same www.123.com, the outlinks
> are(note that abc.com is removed and added with xyz.com)pqr.comxyz.comAt
> the
> end of crawlcycle 2, base has only xyz.com(Expected: have pqr.com and
> xyz.com).  As per the code ParseUtil.java, it seems to be removing the old
> links and insets onlythe new links.if (page.getOutlinks() != null) {
> page.getOutlinks().clear();}Has anyone faced this issue and any fix for
> this? Details of our cluster:10 node EC2 instances on hadoop-0.20.205Nutch
> -
> 2.1HBase - 0.90.6Thanks,Senthil
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-tp4146676.html
> Sent from the Nutch - User mailing list archive at Nabble.com.




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to