[ https://issues.apache.org/jira/browse/NUTCH-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565805#comment-16565805 ]
Hudson commented on NUTCH-2222: ------------------------------- SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1613 (See [https://builds.apache.org/job/Nutch-nutchgora/1613/]) NUTCH-2222 re-fetch deletes all metadata except _csh_ and _rs_ (lewis.mcgibbney: [https://github.com/apache/nutch/commit/c43c2c85874295ef94982694fc28c068d5447234]) * (edit) src/java/org/apache/nutch/fetcher/FetcherJob.java > re-fetch deletes all metadata except _csh_ and _rs_ > ---------------------------------------------------- > > Key: NUTCH-2222 > URL: https://issues.apache.org/jira/browse/NUTCH-2222 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 > Reporter: Adnane B. > Assignee: Furkan KAMACI > Priority: Major > Fix For: 2.4 > > Attachments: NUTCH-2222.patch, TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > <property> > <name>db.fetch.interval.default</name> > <value>60</value> > <description>The default number of seconds between re-fetches of a page (1 > minute) > </description> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)