[ https://issues.apache.org/jira/browse/NUTCH-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney reassigned NUTCH-2222: ------------------------------------------- Assignee: Lewis John McGibbney > fetch deletes all metadata except _csh_ and _rs_ > ------------------------------------------------- > > Key: NUTCH-2222 > URL: https://issues.apache.org/jira/browse/NUTCH-2222 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 > Reporter: Adnane B. > Assignee: Lewis John McGibbney > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > <property> > <name>db.fetch.interval.default</name> > <value>60</value> > <description>The default number of seconds between re-fetches of a page (1 > minute) > </description> -- This message was sent by Atlassian JIRA (v6.3.4#6332)