[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193561#comment-15193561 ] Adnane B. commented on NUTCH-: -- Hi Lewis, Any news about that please ? > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > Attachments: TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178745#comment-15178745 ] Adnane B. edited comment on NUTCH- at 3/3/16 10:43 PM: --- Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please find attached files. Please let me know if it's OK for you to debug the issue with this test. {code} // comment this line to test with org.apache.gora.memory.store.MemStore conf.set ("storage.data.store.class", "org.apache.gora.mongodb.store.MongoStore"); {code} was (Author: abenjell): Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please find attached files. Please let me know if it's OK for you to debug the issue with this test. > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > Attachments: TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178745#comment-15178745 ] Adnane B. edited comment on NUTCH- at 3/3/16 10:42 PM: --- Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please find attached files. Please let me know if it's OK for you to debug the issue with this test. was (Author: abenjell): Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please let me know if it's OK for you to debug the issue with this test > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > Attachments: TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178745#comment-15178745 ] Adnane B. edited comment on NUTCH- at 3/3/16 10:41 PM: --- Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please let me know if it's OK for you to debug the issue with this test was (Author: abenjell): Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please let me know if it's OK for you to debug the problem with this test > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > Attachments: TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Attachment: index.html TestReFetch.java Hi [~lewismc] The problem seems to happen only with gora. With MemStore it works fine Please, install mongodb, copy index.html in build/test/data/refetch-test-site directory, then run the test. Please let me know if it's OK for you to debug the problem with this test > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > Attachments: TestReFetch.java, index.html > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177088#comment-15177088 ] Adnane B. commented on NUTCH-: -- I wrote testReFetch test but it ignores mongodb configuration in nutch-site.xml. any idea please ? storage.data.store.class org.apache.gora.mongodb.store.MongoStore Default class for storing data > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177088#comment-15177088 ] Adnane B. edited comment on NUTCH- at 3/3/16 3:23 AM: -- I wrote testReFetch test but it ignores mongodb configuration in nutch-site.xml. any idea please ? {code} storage.data.store.class org.apache.gora.mongodb.store.MongoStore Default class for storing data {code} was (Author: abenjell): I wrote testReFetch test but it ignores mongodb configuration in nutch-site.xml. any idea please ? storage.data.store.class org.apache.gora.mongodb.store.MongoStore Default class for storing data > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175528#comment-15175528 ] Adnane B. commented on NUTCH-: -- Hello, Can you please tell me if there is any workaround. My deadline is too short and my project depends on nutch. Best regards, > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171889#comment-15171889 ] Adnane B. commented on NUTCH-: -- Thank you very match! > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170580#comment-15170580 ] Adnane B. commented on NUTCH-: -- Please let me know if this issue does not exist with any other persistent storage configuration. > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15170226#comment-15170226 ] Adnane B. commented on NUTCH-: -- Hello, Did you reproduced this issue ? Please let me know if I can help to reproduce it. > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and hbase-0.98.8-hadoop2 was: Centos 6, mongodb 2.6 and mongodb 3.0 > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Comment: was deleted (was: To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) ) > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > Reply -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) Reply > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) Reply was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > Reply -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160086#comment-15160086 ] Adnane B. commented on NUTCH-: -- To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase-0.98.8-hadoop2 > It happens only if the page has not changed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase-0.98.8-hadoop2 It happens only if the page has not changed was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase It happens only if the page has not changed > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase-0.98.8-hadoop2 > It happens only if the page has not changed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb and hbase It happens only if the page has not changed was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb and hbase > It happens only if the page has not changed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> bachid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time a crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> bachid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time a crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> bachid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second time a crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I'm using mongodb was: This problem happens at the the second update on a crawled a page ** that has not changed** with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second time a crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Summary: fetch deletes all metadata except _csh_ and _rs_ (was: updatedb deletes all metadata except _csh_ and _rs_) > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second update on a crawled a page ** that has > not changed** with -all option > bin/nutch updatedb -all > not tested with other options > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adnane B. updated NUTCH-: - Description: This problem happens at the the second update on a crawled a page ** that has not changed** with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb was: This problem happens at the the second update on a crawled page with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb > updatedb deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 >Reporter: Adnane B. > > This problem happens at the the second update on a crawled a page ** that has > not changed** with -all option > bin/nutch updatedb -all > not tested with other options > I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2222) updatedb deletes all metadata except _csh_ and _rs_
Adnane B. created NUTCH-: Summary: updatedb deletes all metadata except _csh_ and _rs_ Key: NUTCH- URL: https://issues.apache.org/jira/browse/NUTCH- Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 2.3.1 Environment: Centos 6, mongodb 2.6 and mongodb 3.0 Reporter: Adnane B. This problem happens at the the second update on a crawled page with -all option bin/nutch updatedb -all not tested with other options I'm using mongodb -- This message was sent by Atlassian JIRA (v6.3.4#6332)