[ https://issues.apache.org/jira/browse/NUTCH-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adnane B. updated NUTCH-2222: ----------------------------- Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : <property> <name>db.fetch.interval.default</name> <value>60</value> <description>The default number of seconds between re-fetches of a page (1 minute) </description> was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : <property> <name>db.fetch.interval.default</name> <value>60</value> <description>The default number of seconds between re-fetches of a page (1 minute) </description> > fetch deletes all metadata except _csh_ and _rs_ > ------------------------------------------------- > > Key: NUTCH-2222 > URL: https://issues.apache.org/jira/browse/NUTCH-2222 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 > Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > <property> > <name>db.fetch.interval.default</name> > <value>60</value> > <description>The default number of seconds between re-fetches of a page (1 > minute) > </description> -- This message was sent by Atlassian JIRA (v6.3.4#6332)